Using Probabilistic Programs to Train Inductive Reasoning in Large Language Models

arXiv cs.CL Papers

Summary

This paper introduces Program-based Posterior Training (PPT), a method that uses LLM-generated probabilistic programs to create distributional targets for fine-tuning inductive reasoning, improving estimation accuracy and calibration on held-out tasks and human-alignment benchmarks.

arXiv:2606.09856v1 Announce Type: new Abstract: Post-training Large Language Models (LLMs) for reasoning typically focuses on deductive tasks such as mathematics and coding where correctness is verifiable. Yet, many real-world reasoning problems are inductive: agents must infer uncertain beliefs from sparse, ambiguous observations. There are challenges to using standard fine-tuning methods for inductive reasoning, including difficulties in curating large-scale, high-quality labeled datasets and in handling targets that are inherently distributional. In this work, we introduce a novel approach, called Program-based Posterior Training (PPT), to address these limitations: we use an LLM to generate diverse open-world scenarios as probabilistic programs, run probabilistic inference to produce distributional target responses to queries, and then fine-tune on these probabilistic soft labels. Using this approach, we fine-tune LLMs on 10,000 programmatically generated scenarios and evaluate on held-out motifs, human-labeled judgments, and external benchmarks. Overall, PPT substantially improves estimation accuracy on held-out inductive tasks, increases alignment with human judgments, and transfers to external benchmarks for estimation and calibration. Additionally, the gains in raw calibration are not subsumed by post-hoc temperature scaling, showing that the models have more deeply internalized uncertainty compared to output rescaling. Together, these results suggest that probabilistic-program-mediated fine-tuning is a promising approach for post-training LLMs to reliably perform approximate inductive inference.
Original Article
View Cached Full Text

Cached at: 06/10/26, 06:09 AM

# Using Probabilistic Programs to Train Inductive Reasoning in Large Language Models
Source: [https://arxiv.org/html/2606.09856](https://arxiv.org/html/2606.09856)
Liyi Zhang Department of Computer Science Princeton University zhang\.liyi@princeton\.edu &Akshay K\. Jagadish Princeton AI Lab Princeton University akshay\.jagadish@princeton\.edu &Brenden M\. Lake Department of Psychology and Computer Science Princeton University brenden@princeton\.edu&Thomas L\. Griffiths Department of Psychology and Computer Science Princeton University tomg@princeton\.edu

###### Abstract

Post\-training Large Language Models \(LLMs\) for reasoning typically focuses on deductive tasks such as mathematics and coding where correctness is verifiable\. Yet, many real\-world reasoning problems are inductive: agents must infer uncertain beliefs from sparse, ambiguous observations\. There are challenges to using standard fine\-tuning methods for inductive reasoning, including difficulties in curating large\-scale, high\-quality labeled datasets and in handling targets that are inherently distributional\. In this work, we introduce a novel approach, called Program\-based Posterior Training \(PPT\), to address these limitations: we use an LLM to generate diverse open\-world scenarios as probabilistic programs, run probabilistic inference to produce distributional target responses to queries, and then fine\-tune on these probabilistic soft labels\. Using this approach, we fine\-tune LLMs on 10,000 programmatically generated scenarios and evaluate on held\-out motifs, human\-labeled judgments, and external benchmarks\. Overall, PPT substantially improves estimation accuracy on held\-out inductive tasks, increases alignment with human judgments, and transfers to external benchmarks for estimation and calibration\. Additionally, the gains in raw calibration are not subsumed by post\-hoc temperature scaling, showing that the models have more deeply internalized uncertainty compared to output rescaling\. Together, these results suggest that probabilistic\-program\-mediated fine\-tuning is a promising approach for post\-training LLMs to reliably perform approximate inductive inference\.

## 1Introduction

Large language models \(LLMs\) have made rapid progress on reasoning tasks, especially in domains such as mathematics and coding\(Weiet al\.,[2022](https://arxiv.org/html/2606.09856#bib.bib6); Chenet al\.,[2021](https://arxiv.org/html/2606.09856#bib.bib2)\)\. These domains largely involve deductive problems: the model is expected to derive a correct answer through a sequence of logical, deterministic intermediate steps\. However, many real\-world reasoning problems are inductive rather than deductiveGriffithset al\.\([2008](https://arxiv.org/html/2606.09856#bib.bib22)\); Tenenbaumet al\.\([2011](https://arxiv.org/html/2606.09856#bib.bib17)\)\. People routinely infer broader patterns from limited evidence and make predictions in situations where the relevant latent factors are only partially observedGriffithset al\.\([2008](https://arxiv.org/html/2606.09856#bib.bib22)\)\. Scientific reasoning, forecasting, diagnosis, and everyday decision\-making require drawing uncertain conclusions from sparse and noisy data\. This requires inferring latent quantities from noisy observations and making calibrated predictionsBlei \([2014](https://arxiv.org/html/2606.09856#bib.bib20)\); Bishop \([2006](https://arxiv.org/html/2606.09856#bib.bib21)\)\.

Training LLMs to perform inductive inference is challenging because high\-quality labeled data is difficult to obtain and the underlying problem is often inherently probabilistic, leading to answers that require uncertainty representation\. Human judgments can be collected, but they are expensive and hard to scale across the diverse open\-world scenarios where induction is needed\. Directly asking stronger LLMs to label such scenarios is scalable[Jagadishet al\.](https://arxiv.org/html/2606.09856#bib.bib33); Jagadishet al\.\([2025](https://arxiv.org/html/2606.09856#bib.bib32)\), but can inherit their biases and errorsChhikara \([2025](https://arxiv.org/html/2606.09856#bib.bib8)\)\. In addition, real\-world probabilistic reasoning tasks often lack known latent variables or calibrated posterior targets\. To better train LLMs to perform inductive inference, we need natural\-language problems that are rich enough to resemble open\-world inductive reasoning but have labels that are grounded in a well\-defined probabilistic model\.

Here, we address this gap by developing a method for curating large\-scale natural language inductive inference data\. Probabilistic programs can provide a bridge between open\-world natural language and precise assumptions and supervision\. LLMs are useful for proposing diverse scenarios and translating them into structured generative modelsRmuset al\.\([2025](https://arxiv.org/html/2606.09856#bib.bib3)\), while probabilistic inference supplies training targets that are more grounded than direct LLM labels\. We introduce a pipeline for scalable generation of inductive reasoning datasets for fine\-tuning LLMs \(Figure[1](https://arxiv.org/html/2606.09856#S1.F1)\)\. Inspired by the Model Synthesis Architecture \(MSA\)Wonget al\.\([2025](https://arxiv.org/html/2606.09856#bib.bib1)\), our approach generates finetuning data from diverse open\-world problems with a sequential prompt procedure: \(1\) synthesizing scenarios from a strong closed\-source LLM; \(2\) generating probabilistic programs to capture these scenarios with an LLM; and \(3\) running probabilistic inference given these programs \(or forward simulation\) to calculate posterior distributions over relevant variables\. We then use those posterior distributions as targets for finetuning\. With this pipeline, we create over 10,000 unique open\-world scenarios with over 50,000 queries and fine\-tune LLMs on the resulting data\. We then evaluate the resulting models on held\-out inductive tasks, compare with human judgments on sports domains, and measure accuracy and calibration on external benchmarks\.

![Refer to caption](https://arxiv.org/html/2606.09856v1/x1.png)Figure 1:Workflow from data generation to LLM fine\-tuning\. A sequential prompting process with three steps results in probabilistic programs in Pyro for synthesized natural\-language scenarios\. These programs are run to generate \(scenario, query, posterior answer\) tuples\. These tuples are used to fine\-tune LLMs\. The plot in the bottom right shows that the resulting pipeline programs produce aligned results with open\-source programs generated by MSA on the same scenarios\.Our approach allows us to train models on natural\-language inductive reasoning problems whose latent structure and uncertainty are explicitly represented\. We find that this approach improves estimation accuracy on held\-out inferential structures, increases agreement with human judgments, improves performance on several transfer benchmarks, and yields better raw calibration while remaining complementary to post\-hoc temperature scaling\. These results suggest that data curation with probabilistic program mediation is a promising direction for training LLMs to approximate inductive inference\.

In summary, we make the following contributions:

- •We study whether large language models can be fine\-tuned to improve on*inductive reasoning*—inferring uncertain latent properties and predictions from sparse natural\-language observations— as opposed to deterministic, deductive reasoning\.
- •We create a dataset with over 10,000 natural\-language inductive inference scenarios with probabilistically grounded supervision\. To do so, we use a scalable extension of MSA that generates open\-world scenarios, synthesizes probabilistic programs for these scenarios, and conducts probabilistic inference to produce targets for fine\-tuning\.
- •We fine\-tune LLMs on the resulting data and evaluate on held\-out motifs and human judgments, showing improved estimation accuracy and stronger agreement with human judgments on unseen inferential structures\.
- •We demonstrate the utility of these principled probabilistic labels by showing that fine\-tuning on a span of probabilistic labels outperforms 1\) directly calling stronger, closed\-source LLM; 2\) fine\-tuning only on the mean of the same distribution\.
- •We evaluate transfer and calibration on seven benchmarks, showing improvements on OpenEstimateMarzoevet al\.\([2026](https://arxiv.org/html/2606.09856#bib.bib15)\)and six multiple\-choice benchmarksSakaguchiet al\.\([2021](https://arxiv.org/html/2606.09856#bib.bib29)\); Hendryckset al\.\([2021](https://arxiv.org/html/2606.09856#bib.bib30)\); Linet al\.\([2022](https://arxiv.org/html/2606.09856#bib.bib31)\); Zellerset al\.\([2019](https://arxiv.org/html/2606.09856#bib.bib34)\); Clarket al\.\([2018](https://arxiv.org/html/2606.09856#bib.bib35)\); Qiuet al\.\([2026](https://arxiv.org/html/2606.09856#bib.bib14)\), as well as improved calibration that is complementary to post\-hoc methods such as temperature scalingGuoet al\.\([2017](https://arxiv.org/html/2606.09856#bib.bib24)\)\.

## 2Related Work

Inductive reasoning in LLMs\.Inductive reasoning has long been studied as probabilistic inference over structured generative models\(Griffithset al\.,[2008](https://arxiv.org/html/2606.09856#bib.bib22); Lakeet al\.,[2015](https://arxiv.org/html/2606.09856#bib.bib11)\), and a growing body of work asks whether LLMs can perform such inferences\. Wong et al\.Wonget al\.\([2023](https://arxiv.org/html/2606.09856#bib.bib16)\)explore rational meaning construction, in which an LLM translates natural language into probabilistic programs in the language\-of\-thought traditionTenenbaumet al\.\([2011](https://arxiv.org/html/2606.09856#bib.bib17)\); Goodmanet al\.\([2014](https://arxiv.org/html/2606.09856#bib.bib18)\)\. Wang et al\.Wanget al\.\([2024](https://arxiv.org/html/2606.09856#bib.bib9)\)formulate inductive reasoning as hypothesis search, in which an LLM proposes natural\-language hypotheses and translates them into deterministic Python programs that are verified against examples\. Yang et al\.Yanget al\.\([2024b](https://arxiv.org/html/2606.09856#bib.bib10)\)evaluate LLMs as inductive reasoners over natural\-language facts and rules\.

Probabilistic programs and LLMs\.Several recent works combine LLMs with probabilistic programs at inference time\. Wu et al\.Wu and Goodman \([2022](https://arxiv.org/html/2606.09856#bib.bib12)\)train neural amortized inference networks for probabilistic programs\. The Model Synthesis Architecture \(MSA\) proposed by Wong et al\.Wonget al\.\([2025](https://arxiv.org/html/2606.09856#bib.bib1)\)addresses open\-world inductive reasoning by using an LLM to synthesize a task\-specific probabilistic model written in WebPPLGoodman and Stuhlmüller \([2014](https://arxiv.org/html/2606.09856#bib.bib28)\)from a natural\-language scenario\. Given sparse observations𝒙\\bm\{x\}, the goal is to answer a query𝒚\\bm\{y\}by inferring latent variables𝒛\\bm\{z\}\. A probabilistic model specifies the distribution

pθ​\(𝒛,𝒙,𝒚\)=pθ​\(𝒛\)​pθ​\(𝒙∣𝒛\)​pθ​\(𝒚∣𝒛,𝒙\),\\displaystyle p\_\{\\theta\}\(\\bm\{z\},\\bm\{x\},\\bm\{y\}\)=p\_\{\\theta\}\(\\bm\{z\}\)\\,p\_\{\\theta\}\(\\bm\{x\}\\mid\\bm\{z\}\)\\,p\_\{\\theta\}\(\\bm\{y\}\\mid\\bm\{z\},\\bm\{x\}\),\(1\)and inference requires computing posterior quantities such as,

pθ​\(𝒛∣𝒙\)\\displaystyle p\_\{\\theta\}\(\\bm\{z\}\\mid\\bm\{x\}\)∝pθ​\(𝒛\)​pθ​\(𝒙∣𝒛\)\\displaystyle\\propto p\_\{\\theta\}\(\\bm\{z\}\)\\,p\_\{\\theta\}\(\\bm\{x\}\\mid\\bm\{z\}\)\(2\)pθ​\(𝒚∣𝒙\)\\displaystyle p\_\{\\theta\}\(\\bm\{y\}\\mid\\bm\{x\}\)=∫pθ​\(𝒚∣𝒛,𝒙\)​pθ​\(𝒛∣𝒙\)​𝑑𝒛\.\\displaystyle=\\int p\_\{\\theta\}\(\\bm\{y\}\\mid\\bm\{z\},\\bm\{x\}\)\\,p\_\{\\theta\}\(\\bm\{z\}\\mid\\bm\{x\}\)\\,d\\bm\{z\}\.\(3\)Given a scenario, a series of prompts leads an LLM to represent it with a probabilistic program\. The LLM first identifies relevant latent factors, observations, and query variables, then writes a probabilistic program in WebPPL encoding their generative relationships\. The probabilistic program is run to obtain posterior estimates over𝒛\\bm\{z\}and𝒚\\bm\{y\}\. Thus, the LLM supplies open\-world knowledge and the probabilistic program formalizes the precise meaning, facilitating coherent posterior inference\.

Fine\-tuning LLMs with probabilistic models\.A line of work suggests that autoregressive modeling can support implicit Bayesian inferenceXieet al\.\([2022](https://arxiv.org/html/2606.09856#bib.bib4)\); Zhanget al\.\([2025](https://arxiv.org/html/2606.09856#bib.bib5)\), motivating the use of probabilistic models as training signals for LLMs\. Recent work has begun to explore this direction\. Hu et al\.Huet al\.\([2024](https://arxiv.org/html/2606.09856#bib.bib13)\)fine\-tune LLMs via GFlowNets to sample from intractable posteriors over latent chains of thought\. Hollman et al\.Hollmannet al\.\([2025](https://arxiv.org/html/2606.09856#bib.bib19)\)use probabilistic models to train transformers to fill in missing values in tabular data\. Qiu et al\.Qiuet al\.\([2026](https://arxiv.org/html/2606.09856#bib.bib14)\)fine\-tune LLMs to mimic a hand\-coded Bayesian assistant in a flight\-recommendation task and demonstrate transfer to structurally similar recommendation tasks\. Their supervision target is the assistant’s point recommendation at each round\. Our approach extends this significantly: it synthesizes a distinct probabilistic program for each open\-world scenario, uses the full posterior as a soft label, and evaluates transfer across heterogeneous inferential motifs and external benchmarks\.

Uncertainty calibration in LLMs\.Calibration of LLM outputs has been studied through both prompting and post\-hoc adjustment\. Chhikara et al\.Chhikara \([2025](https://arxiv.org/html/2606.09856#bib.bib8)\)document persistent overconfidence and distractor effects in current LLMs\. Tiang et al\.Tianet al\.\([2023](https://arxiv.org/html/2606.09856#bib.bib25)\)establish that models can be prompted to produce well\-calibrated verbalized confidences; Shen et al\.Shenet al\.\([2024](https://arxiv.org/html/2606.09856#bib.bib23)\)propose Thermometer, a post\-hoc method that improves calibration across tasks\. Post\-hoc temperature scaling\(Guoet al\.,[2017](https://arxiv.org/html/2606.09856#bib.bib24)\)remains the standard reference baseline\. These methods supervise or adjust meta\-confidence over an existing answer distribution\. We take a different approach: rather than tuning meta\-confidence, we supervise the answer distribution with the Bayesian posterior\. The resulting calibration improvements compose with temperature scaling, which suggests that it shifts the model’s raw uncertainty representation rather than rescaling its outputs\.

## 3Program\-based Posterior Training

![Refer to caption](https://arxiv.org/html/2606.09856v1/x2.png)\(a\)Scenario and query text\.
![Refer to caption](https://arxiv.org/html/2606.09856v1/x3.png)\(b\)LLM posteriors and program\-inferred posteriors \(ground truth\)\.

Figure 2:Effects of Program\-based Posterior Training\. \(a\) Example held\-out scenario and queries\. \(b\) Posteriors over answer\-token probabilities from the base LLM and LLM fine\-tuned on data from probabilistic programs \(Llama3\-8B\-Instruct\)\. Blue dashed line is programmatic inference result, treated as ground truth\. A text appended before the scenario \(see Appendix Section[A\.1](https://arxiv.org/html/2606.09856#A1.SS1)\) instructs the LLM that Q3 and Q4 are estimated on 0\-100 scale, where lower number means Team 1 more likely wins\. LLM posteriors are more similar to those of the program\.Our goal is to construct natural\-language inductive inference data with probabilistically grounded supervision, and then use this data to fine\-tune LLMs\. To do so, we propose the Program\-based Posterior Training \(PPT\) pipeline, illustrated by Figure[1](https://arxiv.org/html/2606.09856#S1.F1)\. The data\-generation stage proceeds in three steps\. First, it uses an LLM to generate diverse natural\-language scenarios and queries\. Second, an LLM is called to translate each scenario into a probabilistic program in PyroBinghamet al\.\([2019](https://arxiv.org/html/2606.09856#bib.bib40)\)\. Third, probabilistic inference or forward simulation is run in the resulting program to produce supervised training targets\. We then fine\-tune open\-source LLMs on these scenario–query–answer examples\.

### 3\.1Data generation

Our data\-generation pipeline \(Figure[1](https://arxiv.org/html/2606.09856#S1.F1)\) builds on MSAWonget al\.\([2025](https://arxiv.org/html/2606.09856#bib.bib1)\), although it uses MSA as a scalable data curation mechanism rather than as an inference\-time cognitive model\. In the original MSA setting, human\-written scenarios are translated into WebPPL programs and evaluated against human judgments\. We extend this procedure in two ways\. First, the LLM is also used to generate the natural\-language scenarios themselves, allowing the pipeline to produce many scenario–query pairs across domains\. Second, we translate the synthesized probabilistic programs into Pyro, which allows us to use Python\-based probabilistic programming and PyTorch infrastructure\. Modern LLMs are generally stronger at writing and editing Python codeTwistet al\.\([2025](https://arxiv.org/html/2606.09856#bib.bib26)\)than WebPPL code, which improves the reliability and expressivity of program synthesis and debugging\.

Each example consists of a natural\-language scenarioss, a queryqq, latent variables𝒛\\bm\{z\}, observations𝒙\\bm\{x\}, and an answer variable𝒚\\bm\{y\}\. The scenario describes an open\-world inductive setting with observed and latent properties \(which are probed in the queries\)\. Figure[2](https://arxiv.org/html/2606.09856#S3.F2)shows an example scenario and inferred posteriors on the queries\. The LLM is sequentially prompted with in\-context examples to:

1. 1\.Propose a scenariossand queriesq1,…,qMq\_\{1\},\.\.\.,q\_\{M\},
2. 2\.Identify the relevant latent variables𝒛\\bm\{z\}and observation structurepθ​\(𝒛,𝒙,𝒚\)p\_\{\\theta\}\(\\bm\{z\},\\bm\{x\},\\bm\{y\}\),
3. 3\.Write a Pyro program encoding the corresponding generative model\.

We use the same framing of probabilistic quantities as MSA, where the program defines the joint distributionpθ​\(𝒛,𝒙,𝒚\)p\_\{\\theta\}\(\\bm\{z\},\\bm\{x\},\\bm\{y\}\)from Equation[1](https://arxiv.org/html/2606.09856#S2.E1)\. To answer the queries in scenarios, the program is run to conduct posterior inference using MCMC or Rejection Sampling, giving the posterior of the quantity being queried given observations,pθ​\(𝒚\|𝒙\)p\_\{\\theta\}\(\\bm\{y\}\|\\bm\{x\}\)from Equation[3](https://arxiv.org/html/2606.09856#S2.E3)\.

As a sanity check, we compared the Pyro programs we generated with the publicly released MSA programs on matched scenarios and queries\. We found that the posterior estimates were in strong alignment across scenarios \(Figure[1](https://arxiv.org/html/2606.09856#S1.F1)bottom right\)\.

Table[1](https://arxiv.org/html/2606.09856#S3.T1)counts scenarios and queries with posterior answers by domain\. The domains we choose areSports\(that MSA also uses\),Healthcare,General\. These domains use the identical pipeline, except each domain uses a separate prompt in the Step 1 scenario generation stage\. They also share the structure in terms of background, conditions \(between 3 and 4\), and queries on an entity\-intrinsic property, an observation\-episodic variable, and two group comparisons\. We provide specific scenario examples in Appendix Section[A\.1](https://arxiv.org/html/2606.09856#A1.SS1)\.

Table 1:Dataset coverage: count of scenarios and queries\.
### 3\.2Fine\-Tuning Format

Posterior sampling\.We use two types of supervision from the synthesized programs\. The first type is based onposterior inference\. The posteriorpθ​\(𝒚\|𝒙\)p\_\{\\theta\}\(\\bm\{y\}\|\\bm\{x\}\)is acquired by running the Pyro program described above\. It then becomes training labels for the answer to the query in the scenario in one of two ways: 1\) a discretized posterior distribution; 2\) a single representative answer, which is the posterior mean\. In 1\), the discretized posterior is realized as probabilities on answer\-tokens and treated as the fine\-tuning target for the LLM cross\-entropy loss\.

Specifically, let the posteriorpθ​\(𝒚\|𝒙\)p\_\{\\theta\}\(\\bm\{y\}\|\\bm\{x\}\)have a finite domain, and let\{b0,…,bK\}\\\{b\_\{0\},\.\.\.,b\_\{K\}\\\}be boundaries of this domain evenly partitioned intoKKbins\. We definepkp\_\{k\}as the posterior probability over binkk:pk=∫bkbk\+1pθ​\(𝒚\|𝒙\)​𝑑𝒚p\_\{k\}=\\int\_\{b\_\{k\}\}^\{b\_\{k\+1\}\}p\_\{\\theta\}\(\\bm\{y\}\|\\bm\{x\}\)d\\bm\{y\}\. Then, the LLM cross\-entropy loss is:

ℒDist=−∑k=0K−1pk​log⁡p^k,\\displaystyle\\mathcal\{L\_\{\\text\{Dist\}\}\}=\-\\sum\_\{k=0\}^\{K\-1\}p\_\{k\}\\log\\hat\{p\}\_\{k\},wherep^k\\hat\{p\}\_\{k\}is the LLM\-predicted probability for “the token ofpkp\_\{k\}”\. In practice, we scale dataset query answers to the 0 to 100 range, andpkp\_\{k\}represents the density around an integer, say 36\. Then “the token ofpkp\_\{k\}” refers to the token for the Arabic number 36\.

We also explore point\-target fine\-tuning\. This second form trains on the posterior mean with probability11, so the LLM essentially performs regular next\-token prediction for the posterior mean rounded to the nearest integer\.

Forward sampling\.As an alternative to posterior sampling, a supervision can also useforward sampling\. Instead of conditioning on observations and running posterior inference, we sample latent variables and outcomes directly from the generative model\. Forward sampling is noisier because each sampled example reflects one draw from the model rather than an inferred posterior\. However, it is cheaper: it avoids iterative posterior inference and can generate additional scenario–query–answer pairs with fewer inference\-time costs, leading to hundreds of times more datapoints generated\. We therefore treat forward\-sampled data as an alternative supervision source\.

### 3\.3Data splits and held\-out motifs

To evaluate generalization beyond repeated surface forms, we split data bymotifsrather than only by individual examples\. A motif denotes an underlying inferential structure, implemented as patterns in the observation process\. For example, one motif is “one person consistently wins but loses one match”\. We replicate the motifs fromMSA\(Wonget al\.,[2025](https://arxiv.org/html/2606.09856#bib.bib1)\)\. Training and validation examples may share broad domains, but held\-out motif evaluation requires the model to generalize to new latent structures\. This is stricter than a random split over paraphrases or entities, because the model cannot solve the evaluation set by memorizing a fixed scenario template\.

## 4Empirical Evaluations

We evaluate whether PPT improves inductive reasoning in LLMs by testing agreement with Pyro posteriors on held\-out scenarios, comparisons with strong base LLMs on these scenarios, agreement with human judgments, and transfer/calibration on external benchmarks\.

Implementational detailsWe use low\-rank adaptation \(LoRA;Huet al\.\([2022](https://arxiv.org/html/2606.09856#bib.bib39)\)\) to fine\-tuneLlama\-3\-8B\-InstructandQwen\-2\-7B\-InstructGrattafioriet al\.\([2024](https://arxiv.org/html/2606.09856#bib.bib36)\); Yanget al\.\([2024a](https://arxiv.org/html/2606.09856#bib.bib37)\)\. For more details, see Appendix Section[A\.1](https://arxiv.org/html/2606.09856#A1.SS1)\.111Code is available at[https://github\.com/zhang\-liyi/llm\-inductive](https://github.com/zhang-liyi/llm-inductive)\.

![Refer to caption](https://arxiv.org/html/2606.09856v1/x4.png)Figure 3:Mean absolute error \(MAE\) where colors indicate different models, base and fine\-tuned\. Axis markers indicate the evaluation data: Held\-out queries uses sports training scenarios, held\-out scenarios have unseen motifs, and held\-out domain \(healthcare\) is where PPT \(Sports\) has not trained on but PPT \(All Domain\) has seen\. For open\-source models, we take the mean response across integers 0 to 100\. For closed\-source models, we use greedy sampling \(Gemini\-3\.1\-Pro\) or average across 100 random seeds with temperature = 1 \(Gemini\-3\.1\-Prosampled\)\. As another baseline,Llama\-3\-Gemini\-Answer\-FT isLlama\-3trained on targets directly generated byGeminirather than by using probabilistic programs\. While seen scenarios are easiest for the our fine\-tuned models, these models also generalize to new motifs and new domains, significantly outperforming the base LLM\. Diversifying training domains gives a further improvement on LLM performance \(All\-Domain vs Sports\)\.### 4\.1Evaluating fine\-tuned LLMs against MSA and human labels

#### 4\.1\.1In\-domain: sports scenarios

We first evaluate models on sports\-domain scenarios similar to those that were used to evaluate MSA\. These scenarios describe novel sports competitions with sparse observations about athletes or teams and query latent properties or future outcomes\. Our validation split uses held\-out scenarios with unseen motifs, so evaluation requires generalization to new inferential structures rather than memorization of templates\.

Fine\-tuned models represent uncertainty structuresFigure[2](https://arxiv.org/html/2606.09856#S3.F2)shows a representative held\-out scenario\. We compare the answer distributions produced by the base LLM, the LLM fine\-tuned by PPT, and the Pyro program\. The base model produces uncalibrated responses, often placing probability mass in uneven and idiosyncratic ways \(e\.g\., on multiples of ten;McCoyet al\.\([2023](https://arxiv.org/html/2606.09856#bib.bib27)\)\), while the fine\-tuned model produces a posterior closer to the distributions computed with probabilistic inference\.

We further verify whether the LLM constructs a consistent model behind scenarios by checking several similar scenarios with known progressively increasing answers on one query\. We manually edit one scenario by performing one action at a time \(flip a win or loss or to swap two players\), creating 8 scenarios labeled S1\-S8, where a player named Jamal progressively goes from weaker to stronger\. We found that all methods correctly show a progressive increase in its mean estimate of Jamal’s strength\. However, the base LLM plateaus more than the PPT\-LLM, which continues to increase by 1\-2% each step for scenarios S7 and S8 and, as a result, maintains closer estimates with the program; see Figure[5](https://arxiv.org/html/2606.09856#A1.F5)in the Appendix Section[A\.2](https://arxiv.org/html/2606.09856#A1.SS2)\.

Next, we consider the overall quantitative comparison on large\-scale held\-out scenarios as shown in Figure[3](https://arxiv.org/html/2606.09856#S4.F3)\. The Figure reports mean absolute error \(MAE\) on three evaluation sets: sports training scenarios with held\-out queries, sports held\-out scenarios with unseen motifs, and healthcare scenarios from held\-out domain\. We find that the LLM fine\-tuned with PPT \(red and orange\) has substantially lower mean\-absolute error on its posterior mean compared to the base LLM\. These results illustrate the intended effect of our approach: the model does not merely shift its point prediction, but learns to assign probability mass in a way that better reflects the uncertainty structure of the underlying probabilistic model\.

Strong closed\-source LLM also fails to answer probabilistic queries wellThese problems are also challenging to strong closed\-source models likeGemini\-3\.1\-Pro\(Figure[3](https://arxiv.org/html/2606.09856#S4.F3)\), which, surprisingly, gives worse performance thanLlama\-3\-8B\. Each model’s mean response over the possible answer domain is more accurate than its argmax response\. ForGemini, its mean response is approximated by averaging over 100 samples\.

Fine\-tuned models match human results more closelyWe compare LLM predictions with human judgments\. For each scenario and query, we aggregate model responses and human responses into mean estimates with standard deviation\. Figure[4](https://arxiv.org/html/2606.09856#S4.F4)shows that our approach substantially improves alignment with human judgments across held\-out sports scenarios\. After fine\-tuning, the model’s responses move closer to the diagonal across most query types, with lower error and stronger correlation\. On an aggregate level, we find that LLM fine\-tuned with PPT \(0\.7750\.775\) had the highest overall correlation \(R2R^\{2\}\) with humans compared to base LLM \(0\.509\) and MSA\-generated program \(0\.769\), respectively\. This suggests that our approach improves agreement with both the true posterior and human judgments\.

![Refer to caption](https://arxiv.org/html/2606.09856v1/x5.png)\(a\)Base pretrained LLM\.
![Refer to caption](https://arxiv.org/html/2606.09856v1/x6.png)\(b\)Fine\-tuned LLM\.

Figure 4:Per\-query mean response and standard\-deviation between humans and LLMs on held\-out sports scenarios\. Fine\-tuned LLMs align much more closely with human responses\. The overall correlationR2R^\{2\}with humans for base LLM, fine\-tuned LLM, and MSA results fromWonget al\.\([2025](https://arxiv.org/html/2606.09856#bib.bib1)\)are: 0\.509, 0\.775, 0\.769, suggesting that our approach achieves similar alignment to probabilistic programs\.
#### 4\.1\.2Out\-of\-domain: healthcare scenarios

We then test whether the gains from our approach transfer beyond the domain used for fine\-tuning\. We evaluate on healthcare\-domain scenarios generated by the same pipeline, where the model must infer latent patient\- or treatment\-relevant quantities from sparse observations \(for an example, see Appendix Section[A\.1](https://arxiv.org/html/2606.09856#A1.SS1)\)\. This provides an out\-of\-domain test for the sports\-fine\-tuned model, since the surface domain, entities, and scenario semantics differ from the sports training data\.

Fine\-tuning reduces MAEFigure[3](https://arxiv.org/html/2606.09856#S4.F3)reports mean absolute error \(MAE\) on in\-domain and out\-of\-domain evaluation sets\. Sports\-only and all\-domain fine\-tuning using generated targets generated using probabilistic programs substantially reduce MAE on all sets\.

The healthcare validation set provides a stronger test of domain transfer\. The sports\-fine\-tuned model improves over the base model despite not being trained on healthcare scenarios, suggesting that our approach teaches some domain\-general inductive behavior\.

Table 2:Transfer to external benchmarks onLlama3\-8BandQwen2\-7B\. NLL/ECE include a temperature\-scaling \(TS\) block\. Best per column inbold\(within each block\)\. On the 3\-class BT datasets, TS lands in a high\-TTregime where the predictive distribution approaches uniform \(NLL→ln⁡3≈1\.099\\to\\ln 3\\approx 1\.099\): Base\+TS hits the optimizer’sT=100T\{=\}100ceiling, whilePPT\+TS finds an interior optimum atT≈4T\\approx 4–77that retains only marginal signal over uniform\.\(a\)NLL across benchmarks \(lower is better\)\.\(b\)Expected Calibration Error \(15\-bin\) \(lower is better\)\.Method:Llama\-3BTBT\-guidedMMLUTruthfulQAHellaSwagARC\-CWinograndeBase0\.4670\.4670\.4090\.4090\.2810\.2810\.4620\.4620\.1960\.1960\.1580\.1580\.2890\.289Verbalized0\.2550\.2550\.3120\.3120\.0840\.0840\.4700\.4700\.1410\.1410\.2110\.2110\.2840\.284PPT\(Distribution\)0\.2180\.218±0\.002\\pm 0\.0020\.1340\.134±0\.007\\pm 0\.0070\.1120\.112±0\.006\\pm 0\.0060\.2750\.275±0\.009\\pm 0\.0090\.0640\.064±0\.010\\pm 0\.0100\.0580\.058±0\.010\\pm 0\.0100\.0940\.094±0\.009\\pm 0\.009PPT\(Mean\)0\.3800\.380±0\.000\\pm 0\.0000\.2460\.246±0\.003\\pm 0\.0030\.2160\.216±0\.002\\pm 0\.0020\.3770\.377±0\.002\\pm 0\.0020\.0600\.060±0\.003\\pm 0\.0030\.1180\.118±0\.005\\pm 0\.0050\.2210\.221±0\.007\\pm 0\.007Base \+ TS——0\.1000\.1000\.2450\.2450\.0230\.0230\.0620\.0620\.1260\.126PPT\(Distribution\) \+ TS0\.0250\.025±0\.002\\pm 0\.0020\.0200\.020±0\.003\\pm 0\.0030\.0690\.069±0\.004\\pm 0\.0040\.2230\.223±0\.014\\pm 0\.0140\.0320\.032±0\.003\\pm 0\.0030\.0510\.051±0\.010\\pm 0\.0100\.0800\.080±0\.002\\pm 0\.002Method:Qwen\-2BTBT\-guidedMMLUTruthfulQAHellaSwagARC\-CWinograndeBase0\.4640\.4640\.4450\.4450\.2730\.2730\.4750\.4750\.1340\.1340\.1350\.1350\.3370\.337PPT\(Distribution\)0\.2780\.278±0\.005\\pm 0\.0050\.2280\.228±0\.005\\pm 0\.0050\.1800\.180±0\.001\\pm 0\.0010\.3810\.381±0\.002\\pm 0\.0020\.0600\.060±0\.001\\pm 0\.0010\.0920\.092±0\.001\\pm 0\.0010\.2460\.246±0\.003\\pm 0\.003PPT\(Mean\)0\.4150\.415±0\.002\\pm 0\.0020\.3860\.386±0\.005\\pm 0\.0050\.2540\.254±0\.001\\pm 0\.0010\.4750\.475±0\.005\\pm 0\.0050\.1190\.119±0\.002\\pm 0\.0020\.1280\.128±0\.004\\pm 0\.0040\.3240\.324±0\.001\\pm 0\.001Base \+ TS——0\.1330\.1330\.4750\.4750\.0350\.0350\.0620\.0620\.1280\.128PPT\(Distribution\) \+ TS0\.0560\.056±0\.004\\pm 0\.0040\.0510\.051±0\.006\\pm 0\.0060\.0950\.095±0\.000\\pm 0\.0000\.3830\.383±0\.003\\pm 0\.0030\.0150\.015±0\.002\\pm 0\.0020\.0500\.050±0\.006\\pm 0\.0060\.1100\.110±0\.001\\pm 0\.001

\(c\)Accuracy across benchmarks; OpenEstimate \(OE\) reported as MAE \(lower is better\)\.

### 4\.2Fine\-tuned LLMs transfer to common external benchmarks

We finally evaluate whether training on data generated from probabilistic programs results in generalizable inductive and probabilistic reasoning capability\. To do this, we consider seven different benchmarks\. We consider OpenEstimate \(OE;Marzoevet al\.\([2026](https://arxiv.org/html/2606.09856#bib.bib15)\)\) as it measures LLM’s prior elicitation\. We evaluate on the Bayesian Teaching dataset \(BT;Qiuet al\.\([2026](https://arxiv.org/html/2606.09856#bib.bib14)\)\) since it features multiple\-choice problems that involve uncertainty and an underlying optimal Bayesian model\. Finally, for a wider performance and calibration evaluation, we use MMLUHendryckset al\.\([2021](https://arxiv.org/html/2606.09856#bib.bib30)\), TruthfulQALinet al\.\([2022](https://arxiv.org/html/2606.09856#bib.bib31)\), HellaSwagZellerset al\.\([2019](https://arxiv.org/html/2606.09856#bib.bib34)\), ARC\-ChallengeClarket al\.\([2018](https://arxiv.org/html/2606.09856#bib.bib35)\), and WinograndeSakaguchiet al\.\([2021](https://arxiv.org/html/2606.09856#bib.bib29)\), which involve challenging multiple\-choice problems that are not purely deterministic\.

MetricsWe use accuracy or mean\-absolute error \(MAE\) for a simple and intuitive performance evaluation, negative log\-likelihood \(NLL\) for how well the LLM models the ground truth answer, and expected calibration error \(ECE\) to measure model calibration\.

BaselinesWe use Base\-Instruct LLM, Verbalized LLMTianet al\.\([2023](https://arxiv.org/html/2606.09856#bib.bib25)\), and temperature scaling \(TS;Guoet al\.\([2017](https://arxiv.org/html/2606.09856#bib.bib24)\)\)\. Verbalized LLM is the base LLM that explicitly outputs confidence level on multiple choices in its response\. This baseline tests whether the base LLM already exhibits reliable and extractable confidence levels beyond token\-level probabilities\. TS is a post\-hoc calibration method that tunes the temperature parameter for each dataset\.

Our approachPPT \(Distribution\), where fine\-tuning was performed on full posterior probabilistic labels, is our primary method\. We also consider a variant, called PPT \(Mean\), that fine\-tunes on the same dataset but targets only the mean of the distribution\. This ablates the distributional target and tests the utility of such a target\. In addition, we implementforward samplingto generate distributional targets, instead of full posterior inference, as it allows creation of orders\-of\-magnitude, bigger datasets due to its inherent scalability\.

PPT transfers to external benchmarksTable[2\(c\)](https://arxiv.org/html/2606.09856#S4.T2.st3)summarizes the results\. PPT \(Distribution\) substantially reduces NLL relative to the base model across all reported benchmarks for bothLlamaandQwen\. The calibration results show a stronger and more consistent pattern on MMLU, HellaSwag, ARC\-Challenge, and Winogrande\. ECE improves substantially: PPT reduces ECE relative to the base model on every listed multiple\-choice benchmark\. This suggests that probabilistic\-program\-mediated fine\-tuning improves the model’s raw uncertainty estimates\.

Temperature scaling \(TS\) remains a strong post\-hoc calibration baseline\. In several cases, Base\+TS closes much of the NLL and ECE gap between the base and fine\-tuned models\. However, applying temperature scaling on top of PPT \(Distribution\) consistently yields further improvements, with lower NLL and ECE on most benchmarks\. This indicates that PPT and post\-hoc calibration are complementary: temperature scaling adjusts confidence on a target distribution, while PPT changes the underlying model to produce better raw uncertainty estimates\.

Finally, PPT also improves accuracy on these benchmarks\. In OpenEstimate, PPT \(Distribution\) substantially improves the prior elicitation accuracy: PPT \(Distribution\) reduces MAE from 29\.0 to 23\.3 forLlama\. This is consistent with the goal of improving quantitative inductive estimation\. Although the gains are modest for other benchmarks, our results suggest that improved prior knowledge elicitation and calibration can contribute to not only improving answer distribution, but also correctly flipping the argmax answer\.

Fine\-tuning without distributional target reduces improvementFor bothLlamaandQwen, PPT \(Distribution\) and PPT \(Mean\) use identical training data except for the target, but PPT \(Distribution\) outperforms on every metric\. PPT \(Distribution\) also attains higher accuracy and stronger post\-TS calibration than models fine\-tuned with forward sampling data \(Table[3\(c\)](https://arxiv.org/html/2606.09856#A1.T3.st3)in the Appendix\)\. This suggests that distributional information plays an important role in the improvements observed in external benchmark transfers\.

## 5Discussion

LLMs have fared well in deductive reasoning tasks, such as mathematics and coding, but their performance on inductive reasoning, which requires inferring uncertain latent properties from sparse observations, has remained insufficient\. While post\-training offers the means to improve inductive inference, collecting high\-quality data at scale and handling inherently distributional targets posed a significant challenge\. To address these shortcomings, we propose program\-based posterior training \(PPT\), a pipeline that \(1\) programmatically synthesizes diverse open\-world scenarios; \(2\) translates these natural language scenarios to probabilistic programs; \(3\) runs posterior inference on the generated program to produce distribution targets; and \(4\) fine\-tunes an LLM on these data to approximate posterior beliefs\. Overall, our results demonstrate that PPT allows models to internalize uncertainty, generalizing to entirely unseen motifs and domains, while remaining complementary to traditional temperature scaling techniques\.

More broadly, our results suggest a different way to think about reasoning post\-training\. Standard reasoning supervision is often framed around problems with single verifiable answers, but inductive reasoning requires models to maintain uncertainty over latent explanations and outcomes\. PPT uses probabilistic programs to construct such a supervision\. Empirically, we found that our approach improves both task\-specific inductive inference and broader uncertainty behavior\. On held\-out synthetic scenarios, PPT substantially reduces MAE relative to base LLMs and strong closed\-source baselines, including settings with unseen motifs and a held\-out healthcare domain\. In human\-labeled sports scenarios, PPT improves alignment with human judgments, reaching a correlation comparable to the original MSA programs\. On external benchmarks, PPT improves OpenEstimate prior\-elicitation MAE and consistently reduces NLL and ECE across multiple\-choice tasks, with gains that remain complementary to temperature scaling\. Together, these results suggest that probabilistic programs can serve not only as test\-time reasoning tools, but also as scalable supervision sources for training LLMs to form calibrated beliefs\.

Limitations and future directionsOur approach relies on LLM\-synthesized probabilistic programs, so the quality of the supervision depends on whether the generated programs faithfully capture the intended scenario\. Although we selectively evaluated some of the generate programs against similar open\-source programs and human results, future work can add stronger automatic validation and human audits of program quality\.

Our experiments focus on numeric and multiple\-choice queries, which allow precise measurement of posterior distributions, MAE, NLL, and ECE\. This does not fully cover open\-ended inductive reasoning\. Extending PPT to more naturalistic settings with intermediate reasoning steps and/or interactions is an important direction\.

ConclusionOur results show that using LLMs to generate probabilistic programs that are in turn used to train LLMs on natural\-language inductive inference problems is an effective way of improving the capabilities of these models\. Fine\-tuning on data generated from probabilistic programs improves held\-out probabilistic estimation, alignment with human judgments, transfer to external estimation and multiple\-choice benchmarks, and raw calibration\. These results suggest that probabilistic programs can serve not only as test\-time reasoning tools, but also as scalable supervision sources for training LLMs to form calibrated beliefs\.

## Acknowledgments

LZ and TLG are supported by grant N00014\-23\-1\-2510 from the Office of Naval Research\. AKJ is supported by a Natural and Artificial Mind \(NAM\) Fellowship from the Scully Peretsman foundation\. BML is supported by the U\.S\. National Science Foundation \(NSF\) under Cooperative Agreement No\. 2433429, NSF AI Research Institute on Interaction for Al Assistants \(ARIA\)\. We thank Katie Collins for helpful discussion\.

## References

- \[1\]E\. Bingham, J\. P\. Chen, M\. Jankowiak, F\. Obermeyer, N\. Pradhan, T\. Karaletsos, R\. Singh, P\. Szerlip, P\. Horsfall, and N\. D\. Goodman\(2019\)Pyro: Deep Universal Probabilistic Programming\.Journal of Machine Learning Research20,pp\. 1–6\.Cited by:[§3](https://arxiv.org/html/2606.09856#S3.p1.1)\.
- \[2\]\(2006\)Pattern recognition and machine learning \(information science and statistics\)\.Springer\.External Links:ISBN 0387310738Cited by:[§1](https://arxiv.org/html/2606.09856#S1.p1.1)\.
- \[3\]D\. M\. Blei\(2014\)Build, compute, critique, repeat: data analysis with latent variable models\.InAnnual Review of Statistics and Its Application,Cited by:[§1](https://arxiv.org/html/2606.09856#S1.p1.1)\.
- \[4\]M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. D\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[§1](https://arxiv.org/html/2606.09856#S1.p1.1)\.
- \[5\]P\. Chhikara\(2025\)Mind the confidence gap: overconfidence, calibration, and distractor effects in large language models\.Transactions on Machine Learning Research \(TMLR\)\.External Links:[Link](https://openreview.net/forum?id=lyaHnHDdZl)Cited by:[§1](https://arxiv.org/html/2606.09856#S1.p2.1),[§2](https://arxiv.org/html/2606.09856#S2.p4.1)\.
- \[6\]P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord\(2018\)Think you have solved question answering? Try ARC, the AI2 reasoning challenge\.arXiv preprint arXiv:1803\.05457\.Cited by:[5th item](https://arxiv.org/html/2606.09856#S1.I1.i5.p1.1),[§4\.2](https://arxiv.org/html/2606.09856#S4.SS2.p1.1)\.
- \[7\]N\. D\. Goodman and A\. Stuhlmüller\(2014\)The design and implementation of probabilistic programming languages\.Note:[http://dippl\.org](http://dippl.org/)Cited by:[§2](https://arxiv.org/html/2606.09856#S2.p2.3)\.
- \[8\]N\. D\. Goodman, J\. B\. Tenenbaum, and T\. Gerstenberg\(2014\)Concepts in a probabilistic language of thought\.InIn Proceedings of the 36th Annual Conference of the Cognitive Science Society,Cited by:[§2](https://arxiv.org/html/2606.09856#S2.p1.1)\.
- \[9\]A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan, A\. Yang, A\. Fan, A\. Goyal, A\. Hartshorn, A\. Yang, A\. Mitra, A\. Sravankumar, A\. Korenev, A\. Hinsvark, A\. Rao, A\. Zhang, A\. Rodriguez, A\. Gregerson, A\. Spataru, B\. Roziere, B\. Biron, B\. Tang, B\. Chern, C\. Caucheteux, C\. Nayak, C\. Bi, C\. Marra, C\. McConnell, C\. Keller, C\. Touret, C\. Wu, C\. Wong, C\. Canton Ferrer, C\. Nikolaidis, D\. Allonsius, D\. Song, D\. Pintz, D\. Livshits, D\. Wyatt, D\. Esiobu, D\. Choudhary, D\. Mahajan, D\. Garcia\-Olano, D\. Perino, D\. Hupkes, E\. Lakomkin, E\. AlBadawy, E\. Lobanova, E\. Dinan, E\. M\. Smith, F\. Radenovic, F\. Guzmá n, G\. Synnaeve, G\. Lee, G\. L\. Anderson, G\. Thattai, G\. Nail, G\. Mialon, G\. Pang, G\. Cucurell, H\. Nguyen, H\. Korevaar, H\. Xu, H\. Touvron, I\. Zarov, I\. A\. Ibarra, I\. Kloumann, I\. Misra, I\. Evtimov, J\. Zhang, J\. Copet, J\. Lee, J\. Geffert, J\. Vranes, J\. Park, J\. Mahadeokar, J\. Shah, J\. van der Linde, J\. Billock, J\. Hong, J\. Lee, J\. Fu, J\. Chi, J\. Huang, J\. Liu, J\. Wang, J\. Yu, J\. Bitton, J\. Spisak, J\. Park, J\. Rocca, J\. Johnstun, J\. Saxe, J\. Jia, K\. V\. Alwala, K\. Prasad, K\. Upasani, K\. Plawiak, K\. Li, K\. Heafield, K\. Stone, K\. El\-Arini, K\. Iyer, K\. Malik, K\. Chiu, K\. Bhalla, K\. Lakhotia, L\. Rantala\-Yeary, L\. van der Maaten, L\. Chen, L\. Tan, L\. Jenkins, L\. Martin, L\. Madaan, L\. Malo, L\. Blecher, L\. Landzaat, L\. de Oliveira, M\. Muzzi, M\. Pasupuleti, M\. Singh, M\. Paluri, M\. Kardas, M\. Tsimpoukelli, M\. Oldham, M\. Rita, M\. Pavlova, M\. Kambadur, M\. Lewis, M\. Si, M\. K\. Singh, M\. Hassan, N\. Goyal, N\. Torabi, N\. Bashlykov, N\. Bogoychev, N\. Chatterji, N\. Zhang, O\. Duchenne, O\. Ç elebi, P\. Alrassy, P\. Zhang, P\. Li, P\. Vasic, P\. Weng, P\. Bhargava, P\. Dubal, P\. Krishnan, P\. S\. Koura, P\. Xu, Q\. He, Q\. Dong, R\. Srinivasan, R\. Ganapathy, R\. Calderer, R\. Silveira Cabral, R\. Stojnic, R\. Raileanu, R\. Maheswari, R\. Girdhar, R\. Patel, R\. Sauvestre, R\. Polidoro, R\. Sumbaly, R\. Taylor, R\. Silva, R\. Hou, R\. Wang, S\. Hosseini, S\. Chennabasappa, S\. Singh, S\. Bell, S\. S\. Kim, S\. Edunov, S\. Nie, S\. Narang, S\. Raparthy, S\. Shen, S\. Wan, S\. Bhosale, S\. Zhang, S\. Zhang, S\. Wang, S\. Agarwal, S\. Sajuyigbe, S\. Chintala, S\. Max, S\. Chen, S\. Kehoe, S\. Satterfield, S\. Govindaprasad, S\. Gupta, S\. Deng, S\. Cho, S\. Virk, S\. Subramanian, S\. Choudhury, S\. Goldman, T\. Remez, T\. Glaser, T\. Best, T\. Koehler, T\. Robinson, T\. Li, T\. Zhang, T\. Matthews, T\. Chou, T\. Shaked, V\. Vontimitta, V\. Ajayi, V\. Montanez, V\. Mohan, V\. Satish Kumar, V\. Mangla, V\. Ionescu, V\. Poenaru, V\. T\. Mihailescu, V\. Ivanov, W\. Li, W\. Wang, W\. Jiang, W\. Bouaziz, W\. Constable, X\. Tang, X\. Wu, X\. Wang, X\. Wu, X\. Gao, Y\. Kleinman, Y\. Chen, Y\. Hu, Y\. Jia, Y\. Qi, Y\. Li, Y\. Zhang, Y\. Zhang, Y\. Adi, Y\. Nam, Y\. Wang, Y\. Zhao, Y\. Hao, Y\. Qian, Y\. Li, Y\. He, Z\. Rait, Z\. DeVito, Z\. Rosnbrick, Z\. Wen, Z\. Yang, Z\. Zhao, and Z\. Ma\(2024\)The llama 3 herd of models\.External Links:2407\.21783Cited by:[§4](https://arxiv.org/html/2606.09856#S4.p2.1)\.
- \[10\]T\. L\. Griffiths, C\. Kemp, and J\. B\. Tenenbaum\(2008\)Bayesian models of cognition\.InCambridge Handbook of Computational Cognitive Sciences,Cited by:[§1](https://arxiv.org/html/2606.09856#S1.p1.1),[§2](https://arxiv.org/html/2606.09856#S2.p1.1)\.
- \[11\]C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger\(2017\)On calibration of modern neural networks\.InProceedings of the 34th International Conference on Machine Learning,pp\. 1321–1330\.Cited by:[5th item](https://arxiv.org/html/2606.09856#S1.I1.i5.p1.1),[§2](https://arxiv.org/html/2606.09856#S2.p4.1),[§4\.2](https://arxiv.org/html/2606.09856#S4.SS2.p3.1)\.
- \[12\]D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt\(2021\)Measuring massive multitask language understanding\.InInternational Conference on Learning Representations,Cited by:[5th item](https://arxiv.org/html/2606.09856#S1.I1.i5.p1.1),[§4\.2](https://arxiv.org/html/2606.09856#S4.SS2.p1.1)\.
- \[13\]N\. Hollmann, S\. Müller, L\. Purucker, A\. Krishnakumar, M\. Körfer, S\. B\. Hoo, R\. T\. Schirrmeister, and F\. Hutter\(2025/01/01\)Accurate predictions on small data with a tabular foundation model\.Nature637\(8045\),pp\. 319–326\.External Links:ISBN 1476\-4687Cited by:[§2](https://arxiv.org/html/2606.09856#S2.p3.1)\.
- \[14\]E\. J\. Hu, M\. Jain, E\. Elmoznino, Y\. Kaddar, G\. Lajoie, Y\. Bengio, and N\. Malkin\(2024\)Amortizing intractable inference in large language models\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.09856#S2.p3.1)\.
- \[15\]E\. J\. Hu, yelong shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen\(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,Cited by:[§4](https://arxiv.org/html/2606.09856#S4.p2.1)\.
- \[16\]A\. K\. Jagadish, M\. Thalmann, J\. Coda\-Forno, M\. Binz, and E\. Schulz\(2025\)Meta\-learning ecological priors from large language models explains human learning and decision making\.arXiv preprint arXiv:2509\.00116\.Cited by:[§1](https://arxiv.org/html/2606.09856#S1.p2.1)\.
- \[17\]A\. K\. Jagadish, J\. Coda\-Forno, M\. Thalmann, E\. Schulz, and M\. BinzHuman\-like category learning by injecting ecological priors from large language models into neural networks\.InForty\-first International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2606.09856#S1.p2.1)\.
- \[18\]B\. M\. Lake, R\. Salakhutdinov, and J\. B\. Tenenbaum\(2015\)Human\-level concept learning through probabilistic program induction\.Science350\(6266\),pp\. 1332–1338\.Cited by:[§2](https://arxiv.org/html/2606.09856#S2.p1.1)\.
- \[19\]S\. Lin, J\. Hilton, and O\. Evans\(2022\-05\)TruthfulQA: measuring how models mimic human falsehoods\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Dublin, Ireland,pp\. 3214–3252\.Cited by:[5th item](https://arxiv.org/html/2606.09856#S1.I1.i5.p1.1),[§4\.2](https://arxiv.org/html/2606.09856#S4.SS2.p1.1)\.
- \[20\]I\. Loshchilov and F\. Hutter\(2019\)Decoupled weight decay regularization\.InInternational Conference on Learning Representations,Cited by:[§A\.1](https://arxiv.org/html/2606.09856#A1.SS1.p1.2)\.
- \[21\]A\. Marzoev, J\. Ross, M\. Cafarella, and J\. Andreas\(2026\)OpenEstimate: evaluating LLMs on reasoning under uncertainty with real\-world data\.InThe Fourteenth International Conference on Learning Representations,Cited by:[5th item](https://arxiv.org/html/2606.09856#S1.I1.i5.p1.1),[§4\.2](https://arxiv.org/html/2606.09856#S4.SS2.p1.1)\.
- \[22\]R\. T\. McCoy, S\. Yao, D\. Friedman, M\. Hardy, and T\. L\. Griffiths\(2023\)Embers of autoregression: understanding large language models through the problem they are trained to solve\.arXiv preprint arXiv:2309\.13638\.Cited by:[§4\.1\.1](https://arxiv.org/html/2606.09856#S4.SS1.SSS1.p2.1)\.
- \[23\]L\. Qiu, F\. Sha, K\. Allen, Y\. Kim, T\. Linzen, and S\. van Steenkiste\(2026\)Bayesian teaching enables probabilistic reasoning in large language models\.Nature Communications17\(1\),pp\. 1238\.External Links:ISBN 2041\-1723Cited by:[5th item](https://arxiv.org/html/2606.09856#S1.I1.i5.p1.1),[§2](https://arxiv.org/html/2606.09856#S2.p3.1),[§4\.2](https://arxiv.org/html/2606.09856#S4.SS2.p1.1)\.
- \[24\]M\. Rmus, A\. K\. Jagadish, M\. Mathony, T\. Ludwig, and E\. Schulz\(2025\)Generating computational cognitive models using large language models\.Advances in Neural Information Processing Systems38,pp\. 87796–87833\.Cited by:[§1](https://arxiv.org/html/2606.09856#S1.p3.1)\.
- \[25\]K\. Sakaguchi, R\. L\. Bras, C\. Bhagavatula, and Y\. Choi\(2021\-08\)WinoGrande: an adversarial winograd schema challenge at scale\.Communications of the ACM64\(9\),pp\. 99–106\.Cited by:[5th item](https://arxiv.org/html/2606.09856#S1.I1.i5.p1.1),[§4\.2](https://arxiv.org/html/2606.09856#S4.SS2.p1.1)\.
- \[26\]M\. Shen, S\. Das, K\. Greenewald, P\. Sattigeri, G\. W\. Wornell, and S\. Ghosh\(2024\)Thermometer: towards universal calibration for large language models\.InProceedings of the 41st International Conference on Machine Learning,R\. Salakhutdinov, Z\. Kolter, K\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, and F\. Berkenkamp \(Eds\.\),Vol\.235,pp\. 44687–44711\.Cited by:[§2](https://arxiv.org/html/2606.09856#S2.p4.1)\.
- \[27\]J\. B\. Tenenbaum, C\. Kemp, T\. L\. Griffiths, and N\. D\. Goodman\(2011\)How to grow a mind: statistics, structure, and abstraction\.Science331\(6022\),pp\. 1279–1285\.Cited by:[§1](https://arxiv.org/html/2606.09856#S1.p1.1),[§2](https://arxiv.org/html/2606.09856#S2.p1.1)\.
- \[28\]K\. Tian, E\. Mitchell, A\. Zhou, A\. Sharma, R\. Rafailov, H\. Yao, C\. Finn, and C\. Manning\(2023\)Just ask for calibration: strategies for eliciting calibrated confidence scores from language models fine\-tuned with human feedback\.InEmpirical Methods in Natural Language Processing \(EMNLP\),Cited by:[§2](https://arxiv.org/html/2606.09856#S2.p4.1),[§4\.2](https://arxiv.org/html/2606.09856#S4.SS2.p3.1)\.
- \[29\]L\. Twist, J\. M\. Zhang, M\. Harman, D\. Syme, J\. Noppen, and D\. Nauck\(2025\)LLMs love Python: a study of LLMs’ bias for programming languages and libraries\.arXiv preprint arXiv:2503\.17181\.Cited by:[§3\.1](https://arxiv.org/html/2606.09856#S3.SS1.p1.1)\.
- \[30\]R\. Wang, E\. Zelikman, G\. Poesia, Y\. Pu, N\. Haber, and N\. Goodman\(2024\)Hypothesis search: inductive reasoning with language models\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.09856#S2.p1.1)\.
- \[31\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems,Red Hook, NY, USA\.External Links:ISBN 9781713871088Cited by:[§1](https://arxiv.org/html/2606.09856#S1.p1.1)\.
- \[32\]L\. Wong, K\. M\. Collins, L\. Ying, C\. E\. Zhang, A\. Weller, T\. Gerstenberg, T\. O’Donnell, A\. K\. Lew, J\. D\. Andreas, J\. B\. Tenenbaum, and T\. Brooke\-Wilson\(2025\)Modeling open\-world cognition as on\-demand synthesis of probabilistic models\.InProceedings of the Annual Meeting of the Cognitive Science Society,Vol\.47\.Cited by:[§A\.1](https://arxiv.org/html/2606.09856#A1.SS1.p11.1),[§1](https://arxiv.org/html/2606.09856#S1.p3.1),[§2](https://arxiv.org/html/2606.09856#S2.p2.3),[§3\.1](https://arxiv.org/html/2606.09856#S3.SS1.p1.1),[§3\.3](https://arxiv.org/html/2606.09856#S3.SS3.p1.1),[Figure 4](https://arxiv.org/html/2606.09856#S4.F4),[Figure 4](https://arxiv.org/html/2606.09856#S4.F4.2.1)\.
- \[33\]L\. Wong, G\. Grand, A\. K\. Lew, N\. D\. Goodman, V\. K\. Mansinghka, J\. Andreas, and J\. B\. Tenenbaum\(2023\)From word models to world models: translating from natural language to the probabilistic language of thought\.External Links:2306\.12672Cited by:[§2](https://arxiv.org/html/2606.09856#S2.p1.1)\.
- \[34\]M\. Wu and N\. Goodman\(2022\)Foundation posteriors for approximate probabilistic inference\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2606.09856#S2.p2.3)\.
- \[35\]S\. M\. Xie, A\. Raghunathan, P\. Liang, and T\. Ma\(2022\)An explanation of in\-context learning as implicit Bayesian inference\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.09856#S2.p3.1)\.
- \[36\]A\. Yang, B\. Yang, B\. Hui, B\. Zheng, B\. Yu, C\. Zhou, C\. Li, C\. Li, D\. Liu, F\. Huang, G\. Dong, H\. Wei, H\. Lin, J\. Tang, J\. Wang, J\. Yang, J\. Tu, J\. Zhang, J\. Ma, J\. Yang, J\. Xu, J\. Zhou, J\. Bai, J\. He, J\. Lin, K\. Dang, K\. Lu, K\. Chen, K\. Yang, M\. Li, M\. Xue, N\. Ni, P\. Zhang, P\. Wang, R\. Peng, R\. Men, R\. Gao, R\. Lin, S\. Wang, S\. Bai, S\. Tan, T\. Zhu, T\. Li, T\. Liu, W\. Ge, X\. Deng, X\. Zhou, X\. Ren, X\. Zhang, X\. Wei, X\. Ren, X\. Liu, Y\. Fan, Y\. Yao, Y\. Zhang, Y\. Wan, Y\. Chu, Y\. Liu, Z\. Cui, Z\. Zhang, Z\. Guo, and Z\. Fan\(2024\)Qwen2 technical report\.External Links:2407\.10671Cited by:[§4](https://arxiv.org/html/2606.09856#S4.p2.1)\.
- \[37\]Z\. Yang, L\. Dong, X\. Du, H\. Cheng, E\. Cambria, X\. Liu, J\. Gao, and F\. Wei\(2024\)Language models as inductive reasoners\.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),Y\. Graham and M\. Purver \(Eds\.\),pp\. 209–225\.Cited by:[§2](https://arxiv.org/html/2606.09856#S2.p1.1)\.
- \[38\]R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi\(2019\)HellaSwag: can a machine really finish your sentence?\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,Cited by:[5th item](https://arxiv.org/html/2606.09856#S1.I1.i5.p1.1),[§4\.2](https://arxiv.org/html/2606.09856#S4.SS2.p1.1)\.
- \[39\]L\. Zhang, M\. Y\. Li, R\. T\. McCoy, T\. Sumers, J\. Zhu, and T\. L\. Griffiths\(2025\)What should embeddings embed? Autoregressive models represent latent generating distributions\.Transactions on Machine Learning Research\.Note:Featured CertificationExternal Links:ISSN 2835\-8856Cited by:[§2](https://arxiv.org/html/2606.09856#S2.p3.1)\.

## Appendix ATechnical appendices and supplementary material

### A\.1Implementational details

HyperparametersHere we detail the experimental setup used in our experiments\. We use PyTorch with Torchtune to fine\-tuneLlama\-3\-8B\-InstructandQwen\-2\-7B\-Instruct\. Each experiment uses a single A100 GPU with 40GB memory\. All methods use the AdamW optimizer\[[20](https://arxiv.org/html/2606.09856#bib.bib41)\], batch\-size of 2, inner\-loops with 5 gradient steps, LoRA adapters with rank=8=8following the standard practice, and learning rate is tuned in\[10−6,10−4\]\\small\{\[10^\{\-6\},10^\{\-4\}\]\}\.

An ‘epoch’ here is defined as training on 200 batches from the train set, and all models train up to 500 epochs, using an early\-stopping of 50 epochs\.

Scenario examplesHere we show a scenario and query example from each domain: sports, healthcare, general\. While most scenarios follow a certain writing format, we also show one free\-form scenario\.

Here is an instruction block that the LLM sees when responding to each scenario\.

LLM InstructionAnswer the query in the scenario and return only an integer wrapped in < and \>\. For example, <x\>\. Use 0\-100 scale\. For a query on individual rank, a higher number means a higher ranking \(e\.g\. 100 means the individual ranks highest in that criterion; 1 is lowest\)\. For a query on which of the two teams wins, a smaller number means the first team more likely wins\.

A sports scenario example is given in the main text in Figure[2](https://arxiv.org/html/2606.09856#S3.F2)\.

Here are the other scenarios\. We also add the program\-inferred posterior mean as answers for illustration\.

Healthcare ScenarioBACKGROUNDIn this endocrine day clinic, patients are monitored during a series of metabolic stress tests\. In each episode, the group that responds better depends on the average insulin sensitivity / glucose regulation efficiency and autonomic stability \(orthostatic tolerance\) of the patients, modulated by missed meals / fasting that episode, acute stress level that episode, and caffeine/alcohol intake that episode\. Patients are under the care of either Dr\. Smith or Dr\. Jones, and all clinical events take place on the same day\.CONDITIONSDr\. Smith cares for Taylor, Ava, Beth, Cara, Dana, Elsa, Faye, Gina, Hope, Iris, Judy, Kara, Lana, and Mary, whereas Dr\. Jones cares for Nina, Opal, Page, Quin, Rosa, Sara, Tara, Uma, Vera, Willa, Xena, Yara, and Zara\.In the first episode, Taylor, Ava, Beth, Cara, Dana, Elsa, and Faye responded better than Gina, Hope, Iris, Judy, Kara, Lana, and Mary\.In the second episode, Taylor, Nina, Opal, Page, Quin, Rosa, and Sara responded better than Tara, Uma, Vera, Willa, Xena, Yara, and Zara\.In the third episode, Taylor, Gina, Hope, Iris, Judy, Kara, and Lana responded better than Nina, Opal, Page, Quin, Rosa, Sara, and Tara\.QUERIESQuery 1: On a percentage scale from 0 to 100%, how strong was missed meals / fasting that episode in the third clinical episode for Gina?Answer 1: 73Query 2: In a new clinical episode later this same day involving Gina, Hope, Iris, Judy, Kara, Lana, and Mary versus Nina, Opal, Page, Quin, Rosa, Sara, and Tara, who would have the better outcome and by how much?Answer 2: 6Query 3: In a new clinical episode later this same day involving Taylor, Ava, Beth, Cara, Dana, Elsa, and Faye versus Tara, Uma, Vera, Willa, Xena, Yara, and Zara, who would have the better outcome and by how much?Answer 3: 5Query 4: Out of 100 random patients, where do you think Taylor ranks in terms of intrinsic insulin sensitivity / glucose regulation efficiency?Answer 4: 68

General Domain ScenarioBACKGROUNDIn this event, watchmakers are competing in a series of timed challenges to assemble intricate mechanical watches\. In each round, the team that achieves the better outcome depends on the average assembly precision of the watchmakers, based on their intrinsic mechanical dexterity modulated by four other factors: lighting quality, tool sharpness, eye fatigue, and ambient humidity\. Watchmakers compete in teams of five, and all challenges take place on the same day\.CONDITIONSIn the first challenge, Julian, Clara, Thomas, Elise, and Victor completed the watch assembly more successfully than Marcus, Sophie, Daniel, Chloe, and Felix\. In the second challenge, Liam, Nora, Oliver, Maya, and Lucas completed the watch assembly more successfully than Henry, Emma, Wyatt, Grace, and Felix\. In the third challenge, Julian, Clara, Thomas, Elise, and Victor completed the watch assembly more successfully than Liam, Nora, Oliver, Maya, and Felix\.QUERIESQuery 1: On a percentage scale from 0 to 100%, how strong was eye fatigue in the third event for Felix?Answer 1: 66Query 2: In a new situation later this same day involving Liam, Nora, Oliver, Maya, and Lucas \(Team 1\) versus Marcus, Sophie, Daniel, Chloe, and Felix \(Team 2\), who would have the better outcome and by how much?Answer 2: 30Query 3: In a new situation later this same day involving Julian, Clara, Thomas, Elise, and Victor \(Team 1\) versus Henry, Emma, Wyatt, Grace, and Lucas \(Team 2\), who would have the better outcome and by how much?Answer 3: 22Query 4: Out of 100 random watchmakers, where do you think Felix ranks in terms of intrinsic mechanical dexterity?Answer 4: 34

Free\-Form ScenarioSeaside Pickleball Invitational: Coaching Staff Chat LogHead Coach Martinez:Let’s review today’s tournament results\. How did our fixed doubles pairs look out there?Asst Coach Davis:It was an interesting day, definitely heavily influenced by the coastal weather\. The morning started dead calm and cool\. Tariq and Leo played Jin and Carlos first\. Tariq and Leo won comfortably\. Tariq’s baseline drives were incredibly sharp right out of the gate\.Head Coach Martinez:Good start\. How about the mid\-day matches?Asst Coach Davis:The sun got absolutely brutal around noon\. Tariq and Leo played Emma and Noah\. Tariq and Leo won narrowly\. To be honest, Leo looked like he was really dragging his feet in the heat and missing easy kitchen volleys that he normally puts away\.Head Coach Martinez:I actually caught the singles exhibition right after lunch\. Tariq played Jin 1\-on\-1\. Tariq won by a lot\. He was covering the whole court and barely looked like he was breaking a sweat\.Asst Coach Davis:Yeah, individually, Tariq is a machine\. But the late afternoon doubles match was a different story\. Tariq and Leo went up against Sarah and Maya\. By then, the coastal wind was howling, which was completely messing with the lightweight ball on lobs and drops\. Plus, it was Tariq and Leo’s third doubles match of the day in the sun\. Sarah and Maya ended up winning narrowly\.Head Coach Martinez:Tough break, but it makes sense\. Sarah and Maya play a very grounded, tactical game that probably holds up well when the wind picks up\. Did any other matches happen before sunset?Asst Coach Davis:Just one evening match, right after the wind completely died down and the temperature dropped back to normal\. Jin and Carlos played Emma and Noah\. Jin and Carlos won comfortably, looking very coordinated\.\*\*\*Post\-Tournament Analytical AssessmentQUERIESQuery 1: What is the probability \(from 0 to 100\) that Tariq and Leo would comfortably defeat Sarah and Maya if they played a rematch first thing the next morning in calm, cool conditions?Answer 1: 0Query 2: What is the probability \(from 0 to 100\) that Jin and Carlos would narrowly defeat Sarah and Maya if they played a match against each other under standard, weather\-neutral conditions?Answer 2: 0Query 3: Out of 100 random competitive pickleball players, where would you rank Tariq’s intrinsic individual skill?Answer 3: 91Query 4: On a scale from 0 to 100 \(where 0 is completely fresh and 100 is completely exhausted\), what was Leo’s likely fatigue level during the late afternoon match against Sarah and Maya?Answer 4: 23

Prompts to LLM in data generationThis section shows further implementational details based on Figure[1](https://arxiv.org/html/2606.09856#S1.F1)\. Step 1 \- scenario generation prompts are shown below\. Our step 2 and step 3 prompts follow those used by\[[32](https://arxiv.org/html/2606.09856#bib.bib1)\]\.

Example Sports Scenario Used In\-Context<START\_SCENARIO\>BACKGROUNDIn this event, the athletes are competing in a series of synchronized diving tournaments\. Each tournament consists of a series of rounds\. In each match, athletes compete as part of a team\. In a given round, each team receives a combined score based on the difficulty of the dive and the execution of the dive\. Athletes compete either individually or as a team\.All matches take place on the same day\.CONDITIONSIn the first round, Jamie and Gale beat Sam and Jordan\.In the second round, Jamie and Gale beat Blake and Avery\.In the third round, Jamie and Avery beat Blake and Sam\.In the fourth round, Cameron and Gale beat Dakota and Joe\.In the fifth round, Jamie and Cameron lost to Blake and Avery\.QUERIESQuery 1: Out of 100 random athletes, where do you think Jamie ranks in terms of intrinsic skill?Query 2: On a percentage scale from 0 to 100% \(where 0=extremely easy, 100=as difficult as possible\), how difficult of a dive do you think the team Jamie was on in the third round attempted?Query 3: In a new round later this same day between Jamie and Gale \(Team 1\) and Sam and Alice \(Team 2\), who would win and by how much?Query 4: In a new round later this same day between Jamie and Cameron \(Team 1\) and Avery and Drew \(Team 2\), who would win and by how much?<END\_SCENARIO\>

Sports Scenario Generation PromptYou will be asked to generate a scenario based on the given example scenario\. A scenario consists of a background, 3 or 4 conditions, and 4 queries\. Generate scenarios in the domain of sports\. Keep the type of queries the same\. In the conditions, keep them relatively concise similar to the example, and do not exactly specify the win and loss points\. Also, prioritize the form “A beat B” compared to “A beat B 1 on 1” or “A beat B in a singles match”\. Notice that the background discusses that performance depends on individual\-inherent variable\(s\), <MAIN\_VARIABLE\>, and episodic variable\(s\), <EPISODIC\_VARIABLE\>\. You can improvise the phrasing, instead of saying “performance depends on x and y”\.Create scenarios where, for one of the players denoted ’X’, <P\>, <C\>, and <R\>\. Choose <N\> for team\-size\. <1v1 OPTION\>Use a random name for player ’X’ \(don’t actually use ’X’\)\. Query 1 should ask about the individual\-inherent factor\. Query 2 should ask about the episodic factor\.Use the following sports subdomain for the scenario: <SUBDOMAINS\>\. Here is the example scenario, and be sure to use the same <START\_SCENARIO\> <END\_SCENARIO\> to wrap around the scenario:<Example sports scenario here\>

Free\-Form Scenario Generation PromptYou are generating a synthetic “open\-world” sports scenario for training/evaluating probabilistic reasoning in LLMs\.Core idea\- A scenario describes a small sports world where outcomes depend on \(i\) each athlete’s intrinsic strength and \(ii\) multiple latent, match\-dependent factors \(e\.g\., effort, fatigue, teamwork chemistry, strategy matchups, weather, injuries, home advantage, equipment\)\.\- Observations are incomplete and noisy: the reader must infer hidden traits and hidden match factors from a few results and contextual clues\.Freedom requirement \(important\)\- Do NOT use a rigid template like “BACKGROUND / CONDITIONS / QUERIES”\.\- You may vary format: a short story, a coach’s notes, a sports journalist recap, a chat log, a referee report, a table of results, etc\.\- Vary the number of observed events \(e\.g\., 3–8 results\) and the number of queries \(e\.g\., 3–6\)\.\- Keep the scenario self\-contained\.Output formatting\- Produce exactly 1 scenario\.\- Wrap the scenario in <START\_SCENARIO\> … <END\_SCENARIO\>\.\- Inside the scenario, include:\(a\) a narrative \+ observable match results,\(b\) enough detail to imply latent factors without explicitly revealing them,\(c\) a set of queries that require inference and uncertainty\.Sports world design constraints \(must\-have\)\- Create a scenario where, for one of the players denoted ’X’, <P\>, <C\>, and <R\>\. Use <N\> for team\-size\. <1v1 OPTION\>Use a random name for player ’X’ \(don’t actually use ’X’\)\.\- Choose from the following sports subdomains for this scenario: <SUBDOMAINS\>\.\- Do not specify exact point totals; if needed, use qualitative margins \(“narrowly”, “comfortably”, “by a lot”, “slightly”, “in overtime”, “came from behind”\)\.\- You may include time ordering \(“earlier that day”, “two days later”\) and let fatigue/learning matter\.Latent factors \(must\-have\)\- The scenario must have at least TWO latent match\-dependent factors beyond intrinsic strength\.\- At least one factor should plausibly vary across matches \(e\.g\. effort, fatigue, tilt, coordination, weather\)\.Queries \(must\-have\)Make queries that require:1\) Estimating an athlete’s intrinsic strength as a distribution or percentile rank \(e\.g\., “out of 100 athletes, where do they rank?”\)\.2\) Inferring a latent match factor on a 0–100 scale for a specific match \(effort / teamwork / fatigue / etc\.\)\.3\) Predicting outcomes of 1–2 hypothetical future matches, including \*who wins\* and \*by how much\* qualitatively \(or as a probability\), while acknowledging uncertainty\.Important: each query should be answerable by one number\. Wrong examples: 1\) explicitly asking for verbal justification; 2\) asking about two players’ intrinsic strengths within one query\. However, do not explicitly say ’answer the queries in one number’\. It should be implicit in the queries like in the example scenario below\.Uncertainty instruction\- Ensure that multiple explanations remain plausible\. Do not leak the true strengths or factor values\.\- Avoid making any single athlete dominate all outcomes; include at least one “surprising” result that is explainable via latent factors\.An example scenario is given below\. Remember to vary the form and story, but keep the query format the same\.<Example sports scenario here\>Now generate the scenario following all rules above\.

These are themotifsused:

P for probability / difficulty: motifs\_P = \[’X consistently wins’, ’X consistently loses’, ’X wins all but one match’, ’X loses all but one match’\]

C for confounded teammates: motifs\_C = \[ ’X always teams up with the same teammate\(s\)’, ’X teams up with different player\(s\) most of the times’ \]

R for round\-robin \(team rotation\): motifs\_R = \[ ’players rotate across teams’, ’players have fixed teams’, ’players generally have fixed teams, except X’ \]

### A\.2Additional experiments

![Refer to caption](https://arxiv.org/html/2606.09856v1/x7.png)Figure 5:Mean estimates on Jamal’s strength on eight scenarios that progressively goes from weaker to stronger Jamal\.All methods in general correctly show a progressive increase in its mean estimate on Jamal’s strength\. However, base LLM plateaus more than the PPT\-LLM, which still increases by 1\-2 % each step for the S7 and S8, as well as maintaining closer estimates with the program\.Progressively changing scenariosWe manually edit the scenario by using one action at a time \(flip a win or loss or to swap two players\), creating 8 scenarios S1 to S8 where player Jamal progressively goes from weaker to stronger\. All methods in general correctly show a progressive increase in its mean estimate on Jamal’s strength\. However, base LLM plateaus more than the PPT\-LLM, which still increases by 1\-2 % each step for the S7 and S8, as well as maintaining closer estimates with the program \(Figure[5](https://arxiv.org/html/2606.09856#A1.F5)\)\.

Posterior sampling and forward samplingWe compare two approaches we used to generate data for LLM fine\-tuning:posterior samplingandforward sampling\. Posterior sampling fine\-tunes on a distributional target, whereas forward sampling, by construction, generates massive amounts of data with only a point estimate available for each query\.

Table[3\(c\)](https://arxiv.org/html/2606.09856#A1.T3.st3)shows thatposterior samplingattains better accuracy whileforward samplingattains comparable\-to\-better raw calibration and negative log\-likelihood\. Meanwhile, when temperature scaling \(TS\) is applied,posterior samplinggains an edge on these two metrics\.

Table 3:Transfer to external benchmarks onLlama\-3\-8B\. NLL/ECE include a temperature\-scaling \(TS\) block\. Best per column inbold\(within each block\)\.\(a\)NLL across benchmarks \(lower is better\)\.\(b\)Expected Calibration Error \(15\-bin\) \(lower is better\)\.\(c\)Accuracy across benchmarks; OpenEstimate \(OE\) reported as MAE \(lower is better\)\.

Similar Articles

Enhanced and Efficient Reasoning in Large Learning Models

arXiv cs.AI

This paper proposes a method for improving reasoning in large language models by recoding data to explicitly represent relationships, enabling efficient principled reasoning with polynomial-time learnability for relational rules, which addresses hallucinations and supports sound reasoning across multiple calls.

Probabilistic Calibration Is a Trainable Capability in Language Models

arXiv cs.CL

This paper investigates whether probabilistic calibration in language models can be improved through fine-tuning, comparing soft-target and hard-target methods across 12 models. The results show that calibration is a trainable capability, though gains sometimes reduce downstream arithmetic reasoning capabilities.

Probabilistic Attribution For Large Language Models

arXiv cs.CL

This paper proposes a model-agnostic probabilistic token attribution measure for LLMs using Bayes' rule to invert next-token log probabilities, capturing the model's internal representation of token sequences and improving interpretability through entropy analysis.