When the Next Step Is Not One Step: Distribution-Aware Execution Modeling for Concurrent Go Programs

arXiv cs.LG 06/17/26, 04:00 AM Papers
Summary
This paper proposes a distribution-aware training approach for modeling next-event predictions in concurrent Go programs, treating scheduler nondeterminism as a signal. Fine-tuning a 7B model on fewer than a thousand traces achieves 36.2% accuracy on production bugs, outperforming Gemini 3.5 Flash zero-shot.
arXiv:2606.17508v1 Announce Type: new Abstract: Training a model to predict the next step in a concurrent program is harder than it looks: two runs of the same program from the same trace prefix can produce different next events, both valid, because the scheduler is nondeterministic. A model trained against a single label is learning to guess one outcome of a random process. We turn this around and use the nondeterminism as a training signal. We run each program many times, aggregate the observed next events into an empirical distribution, and fine-tune a 7B model to match that distribution with a KL objective. On 798 held-out predictions drawn from real production Go bugs (CockroachDB, Kubernetes, gRPC, etcd), fine-tuning on fewer than a thousand traces reaches 36.2% accuracy, ahead of Gemini 3.5 Flash used zero-shot (34.8%) and the same model without fine-tuning (28.6%). Distribution training matches cross-entropy on accuracy (35.8% vs. 36.2%) while reducing Expected Calibration Error from 0.205 to 0.169. We also derive a formal goroutine-leak signature for a class of select-blocked goroutines where P(GoUnblock)=0 holds by scheduler semantics, not by learning. We release the dataset, trained adapters, and all tooling.
Original Article
View Cached Full Text
Cached at: 06/17/26, 05:39 AM
# When the Next Step Is Not One Step: Distribution-Aware Execution Modeling for Concurrent Go Programs
Source: [https://arxiv.org/html/2606.17508](https://arxiv.org/html/2606.17508)
\\tocauthor

Kaviru Hapuarachchi11institutetext:University of Colombo School of Computing, 35 Reid Avenue, Colombo 07, Sri Lanka 11email:2022is031@stu\.ucsc\.cmb\.ac\.lk

###### Abstract

Training a model to predict the next step in a concurrent program is harder than it looks: two runs of the same program from the same trace prefix can produce different next events, both valid, because the scheduler is nondeterministic\. A model trained against a single label is learning to guess one outcome of a random process\. We turn this around and use the nondeterminism as a training signal\. We run each program many times, aggregate the observed next events into an empirical distribution, and fine\-tune a 7B model to match that distribution with a KL objective\. On 798 held\-out predictions drawn from real production Go bugs \(CockroachDB, Kubernetes, gRPC, etcd\), fine\-tuning on fewer than a thousand traces reaches 36\.2% accuracy, ahead of Gemini 3\.5 Flash used zero\-shot \(34\.8%\) and the same model without fine\-tuning \(28\.6%\)\. Distribution training matches cross\-entropy on accuracy \(35\.8% vs\. 36\.2%\) while reducing Expected Calibration Error from 0\.205 to 0\.169\. We also derive a formal goroutine\-leak signature for a class of select\-blocked goroutines whereP\(GoUnblock\)=0P\(\\textsf\{GoUnblock\}\)=0holds by scheduler semantics, not by learning\. We release the dataset, trained adapters, and all tooling\.

###### keywords:

concurrent execution modeling; nondeterminism; distribution estimation; model calibration; Go; goroutine scheduling; concurrency bugs

## 1Introduction

A model trained on execution traces learns to predict what a program does, not just what it looks like\. This is the idea behind Code World Models \(CWMs\): trained on runtime state snapshots after each statement, a language model learns the state transition function, which improves verification, debugging, and code generation\[[1](https://arxiv.org/html/2606.17508#bib.bib1)\]\. For sequential Python the approach works well\.

But it rests on a quiet assumption: that execution is deterministic\. Given a state and the next statement, there is one next state\. For concurrent code that assumption fails\. When goroutines run in parallel and communicate over channels, the scheduler interleaves them differently on every run\. The same prefix can be followed by a block, a start, or an unblock, all equally valid\. Train a model to predict a single next event here and you are asking it to memorize one arbitrary outcome of a random process\. It will learn very little\.

This is not a minor technicality\. Concurrency bugs, deadlocks, data races, goroutine leaks, have no sequential analogue\. They depend on schedule, evade unit tests, and are hard even for experts to anticipate\[[3](https://arxiv.org/html/2606.17508#bib.bib3),[4](https://arxiv.org/html/2606.17508#bib.bib4)\]\. If execution\-trace models are to be useful where the bugs are hardest, they must handle the nondeterminism that defines the setting\.

We propose treating the nondeterminism as a signal rather than noise\. Run a concurrent program many times and the distribution of next events is not random error\. It is a measurement of which futures the scheduler actually permits\. We reformulate next\-event prediction as distribution estimation: the target for a prefix is the empirical distribution over next events observed across repeated runs, and we train a model to match it \(Fig\.[1](https://arxiv.org/html/2606.17508#S1.F1)\)\.

We make the following contributions:

- •We show that a 7B model fine\-tuned on fewer than a thousand concurrent traces reaches 36\.2% next\-event accuracy on held\-out production bugs from CockroachDB, Kubernetes, gRPC, and etcd, beating Gemini 3\.5 Flash zero\-shot \(34\.8%\) and the same model without fine\-tuning \(28\.6%\)\.
- •We show that training with a KL objective against empirical distributions matches cross\-entropy on accuracy \(35\.8% vs\. 36\.2%\) while improving calibration: Expected Calibration Error drops from 0\.205 to 0\.169 and model entropy correlates with program nondeterminism\.
- •We derive a formal leak signature for select\-blocked goroutines, showing thatP\(GoUnblock\)=0P\(\\textsf\{GoUnblock\}\)=0at every trace depth follows from Go scheduler semantics, not from learning\.
- •We report where the approach falls short: accuracy plateaus near 35–36% regardless of model or objective, rare event types are never learned, and multi\-step predictions lose scheduler coherence after roughly one step\.

We are not claiming a deployable bug detector or a production\-ready execution simulator\. The contribution is a formulation, a dataset, and a set of baselines that show nondeterminism\-aware training works and make clear what would need to change for it to work better\.

// goroutine leak ch := make\(chan int\) go func\(\) \{ ch <\- 42 // never read \}\(\) select \{ case <\-ch: \.\.\. case <\-After\(t\): \.\.\. \}ProgramPP\+ trace prefixτ1:k\\tau\_\{1:k\}CCWMf\(⋅\)f\(\\cdot\)Prior CWMs \(sequential\) single label:GoBlock=1\.0=1\.0 trained with cross\-entropy *assumes one valid future*Ours: distributionp^\\hat\{p\}over next events0\.60Block0\.20Start0\.20UnblSchedEndCreat

Figure 1:Concurrent execution has many valid next steps, not one\. Prior Code World Models \(blue\) predict a single next event and train against it with cross\-entropy, which works for sequential code but is ill\-defined when the same prefix can legitimately produce several different events\. We predict a full distribution over next events \(orange\), with targets derived from repeated runs of the same program\. The right panel shows an example distribution for a goroutine\-leak program:GoBlockdominates andP\(GoUnblock\)=0P\(\\textsf\{GoUnblock\}\)=0, a signature that follows from the scheduler, not from learning\.
## 2Background and Related Work

Code world models and execution traces\.A world model learns an environment’s transition function: state and action in, next state out\[[1](https://arxiv.org/html/2606.17508#bib.bib1)\]\. CWMs apply this to program execution, training on action\-state pairs from interpreter traces so the model predicts the next action and resulting state from a partial trace\[[1](https://arxiv.org/html/2606.17508#bib.bib1)\]\. Meta’s 32B CWM, trained on Python traces and agentic trajectories, showed that this grounding improves coding and reasoning\[[1](https://arxiv.org/html/2606.17508#bib.bib1)\]\. A follow\-up study of CWM failure modes found that errors concentrate in two regimes: token\-budget exhaustion on long traces and string\-valued state confused by subword tokenization\[[2](https://arxiv.org/html/2606.17508#bib.bib2)\]\. Both studies assume deterministic, sequential execution\.

Concurrency and LLMs\.Concurrent code introduces nondeterministic scheduling and bug classes, races, deadlocks, starvation, that have no sequential equivalent\. Unit\-test evaluation cannot systematically explore thread schedules\[[3](https://arxiv.org/html/2606.17508#bib.bib3)\]\. The CONCUR benchmark targets concurrent code generation, judged by model checking\[[3](https://arxiv.org/html/2606.17508#bib.bib3)\]\. Our problem is different: modeling the execution of concurrent programs as a learned transition function, using scheduling nondeterminism as a distributional target\. The GoKer/GoBench corpus of real Go concurrency bugs\[[4](https://arxiv.org/html/2606.17508#bib.bib4)\]provides our held\-out test programs\.

Distribution matching and calibration\.Training against a target distribution rather than a point label is an established technique: soft\-target objectives improve language\-model calibration, often using an empirical distribution from a few hundred samples as the target\[[5](https://arxiv.org/html/2606.17508#bib.bib5),[6](https://arxiv.org/html/2606.17508#bib.bib6)\]\. Our contribution is not distribution matching itself but its source\. The nondeterminism of concurrent execution, observed through repeated runs, gives a principled empirical target over next events\. We are not aware of prior work deriving distributional training targets from concurrent execution traces\.

Table[1](https://arxiv.org/html/2606.17508#S2.T1)positions our work against the two closest lines of prior research\.

Table 1:Comparison against prior execution\-trace modeling and the closest concurrency benchmark across dimensions relevant to this problem\. This work is the first to combine execution\-trace modeling with concurrent programs and nondeterministic distributional targets\.
## 3Problem Formulation

We consider Go programs that spawn multiple goroutines\. The runtime tracer emits scheduler events and we work with six event types covering goroutine lifecycle and synchronization:

ℰ=\{GoBlock,GoCreate,GoEnd,GoSched,GoStart,GoUnblock\}\.\\mathcal\{E\}=\\\{\\textsf\{GoBlock\},\\ \\textsf\{GoCreate\},\\ \\textsf\{GoEnd\},\\ \\textsf\{GoSched\},\\ \\textsf\{GoStart\},\\ \\textsf\{GoUnblock\}\\\}\.A trace is a sequence of snapshotsτ=\(s1,…,sn\)\\tau=\(s\_\{1\},\\dots,s\_\{n\}\), each recording the triggering event type, the goroutine id, and per\-goroutine status \(running / runnable / blocked / dead\)\. Given a prefixτ1:k\\tau\_\{1:k\}and program sourcePP, the task is to predict the event type ofsk\+1s\_\{k\+1\}\.

From label to distribution\.Standard CWM training takes the next event as one label and minimizes cross\-entropy\. Because execution is nondeterministic, repeated runs ofPPfrom comparable prefixes produce different valid next events\. For a groupggof runs sharing a program and prefix depth, letcg\(e\)c\_\{g\}\(e\)count next events of typeee\. The empirical distribution and its Dirichlet posterior \(Jeffreys priorα=0\.5\\alpha\{=\}0\.5\) are

p^g\(e\)=cg\(e\)∑e′cg\(e′\),α~g\(e\)=α\+cg\(e\)\.\\hat\{p\}\_\{g\}\(e\)=\\frac\{c\_\{g\}\(e\)\}\{\\sum\_\{e^\{\\prime\}\}c\_\{g\}\(e^\{\\prime\}\)\},\\qquad\\tilde\{\\alpha\}\_\{g\}\(e\)=\\alpha\+c\_\{g\}\(e\)\.\(1\)The learning target isp^g\\hat\{p\}\_\{g\}, not a single label\. We approximate “same prefix” by matching split depths \(25/50/75% of the trace\)\. Because interleavings differ across runs, grouped prefixes are structurally comparable but not byte\-identical\. We revisit this approximation in Section[9](https://arxiv.org/html/2606.17508#S9)\.

## 4Dataset and Evaluation Setup

Programs\.We assembled 130 concurrent Go programs in three groups \(Table[2](https://arxiv.org/html/2606.17508#S4.T2)\)\. Hand\-crafted programs cover channel, mutex, select, pipeline, waitgroup, and fan\-in/out patterns, with intentional deadlocks, races, and leaks\. Generated programs are synthesized with randomized parameters and optional injected bugs, each verified to compile\. Real\-world programs are reduced concurrency\-bug kernels from the GoKer/GoBench corpus\[[4](https://arxiv.org/html/2606.17508#bib.bib4)\], drawn from production systems including CockroachDB, Kubernetes, gRPC, etcd, Istio, and Moby\. Each program carries metadata describing outcome, pattern, goroutine count, and expected nondeterminism\.

Table 2:Program corpus\. All 66 real\-world GoKer programs are held out from training, so accuracy on those programs measures out\-of\-distribution generalization to code the model never saw during fine\-tuning\.Traces\.Each program is compiled with the race detector and run five times under the runtime tracer\. Differing interleavings across runs are the variability we exploit\. From each run we form examples at 25/50/75% prefix depth\. Deadlocking programs emit a no\-next\-event marker and are excluded from distribution aggregation\. The tracer exposes scheduler events but not channel buffers, mutex holders, or local variables, so the model must reason from scheduling behavior and program structure\.

Splits\.We hold out all 66 real\-world programs for evaluation and train only on hand\-crafted and generated programs\. This yields 945 training examples and 798 held\-out next\-event predictions, plus 75 aggregated empirical\-distribution groups\.

## 5Method

Format\.Each example renders the program source, the partial trace as JSON, and the current goroutine states, then asks for the next event\. The point target is a JSON object\{"event\_type":\.\.\., "goroutine\_id":\.\.\.\}; the distribution target is the six\-way vectorp^g\\hat\{p\}\_\{g\}\. Prompts are left\-truncated at the source so the target is never cut off\.

Cross\-entropy baseline\.We fine\-tune Qwen2\.5\-Coder \(1\.5B and 7B\) with cross\-entropy over response tokens using 4\-bit QLoRA \(rank 16,α=32\\alpha\{=\}32, gradient checkpointing\)\. To approximate the empirical distribution under a point loss, examples are duplicated in proportion to observed next\-event frequency\.

KL distribution loss\.The distribution objective adds a KL term at the token position that discriminates between event types\. With logitszzrestricted to the six event\-type tokens,q=softmax\(z\)q=\\mathrm\{softmax\}\(z\), and empirical targetp^g\\hat\{p\}\_\{g\},

ℒ=ℒCE\+λKL\(p^g∥q\),\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\+\\lambda\\,\\mathrm\{KL\}\(\\hat\{p\}\_\{g\}\\,\\\|\\,q\),\(2\)withλ=0\\lambda\{=\}0recovering the CE ablation\. Because all six event types share the leading subword “Go,” we place the KL term at the second, discriminating token\. The 7B QLoRA configuration is identical to the CE baseline so that any difference is attributable to the objective alone\.

Multi\-step coherence probe\.To measure how far single\-step prediction extends before it breaks, we feed the model’s own predictions back as input: from a real prefix, predict the next event, append it, and repeat for up to 15 steps\. A symbolic scheduler finite\-state machine checks each predicted transition against Go invariants \(for example, a blocked goroutine cannot be started\)\. We report survival steps, meaning valid transitions before the first violation, along with the per\-step event distribution and entropy\. We use this as a diagnostic tool in Section[7](https://arxiv.org/html/2606.17508#S7)to locate the boundary between what the model can and cannot do\.

## 6Results

Table 3:Next\-event accuracy on the held\-out real\-world GoKer set\. Fine\-tuning on concurrent traces from hand\-crafted programs generalizes to real production bugs, outperforming both the zero\-shot baseline and Gemini 3\.5 Flash\. Distribution training with KL loss achieves comparable accuracy to cross\-entropy while improving calibration \(Section[7](https://arxiv.org/html/2606.17508#S7)\)\.Table 4:In\-distribution accuracy on hand\-crafted programs, included for context\. These numbers are not directly comparable to Table[3](https://arxiv.org/html/2606.17508#S6.T3)because the model used here is an earlier Gemini generation evaluated with a different prompt format\.Fine\-tuning generalizes out\-of\-distribution\.Table[3](https://arxiv.org/html/2606.17508#S6.T3)shows the main result\. Trained on 945 hand\-crafted and generated traces, a 7B model reaches 36\.2% on real\-world GoKer bugs it never saw during training\. This beats the same model zero\-shot \(28\.6%\) and Gemini 3\.5 Flash used zero\-shot \(34\.8–35\.2%\)\. Small\-scale concurrent\-trace supervision transfers to structurally different real code\.

Distribution training matches accuracy\.KL training reaches 35\.8%, which is not meaningfully different from CE’s 36\.2%\. We view this as the expected outcome: distribution training should not sacrifice accuracy, and the benefit appears in calibration rather than top\-1 prediction \(Section[7](https://arxiv.org/html/2606.17508#S7)\)\.

An accuracy ceiling near 35–36%\.Three approaches, CE fine\-tuning, KL fine\-tuning, and a strong zero\-shot model, all land within one percentage point of each other\. That clustering is itself informative: predicting the next scheduler event of a real concurrency bug from traces alone is hard, and neither scale nor objective moves the ceiling\. Section[7](https://arxiv.org/html/2606.17508#S7)traces this to rare events and distribution shift\.

Reasoning does not help\.Enabling thinking in Gemini 3\.5 Flash slightly lowers accuracy \(34\.8% vs\. 35\.2%\)\. The relevant signal here is structural, not multi\-step deductive\.

## 7Analysis: Where the Ceiling Comes From

Per\-event accuracy\.Table[5](https://arxiv.org/html/2606.17508#S7.T5)breaks accuracy down by event type\. The model learns common lifecycle events,GoStartat 47%,GoCreateat 44%,GoBlockat 36%, but never predictsGoEndorGoSched\(both 0%\) and rarely getsGoUnblockright \(8%\)\.

Table 5:Per\-event accuracy of the KL model on GoKer, alongside train and test frequency\. Two events with near\-zero training frequency,GoEndandGoSched, are never predicted correctly\. MeanwhileGoCreateachieves 44% accuracy from only 1\.5% of training examples, suggesting the model reasons from program structure rather than label frequency\.Two effects compound to produce the ceiling\. First, class imbalance:GoEnd\(2%\) andGoSched\(0\.4%\) are so rare in training that the model never emits them, and KL does not fix this because the empirical targets are themselves skewed toward common events\. Second, distribution shift: real GoKer programs exerciseGoCreate\(21% of test\) andGoSched\(7%\) far more than the hand\-crafted training set does\. TheGoCreateresult is worth noting: 44% accuracy from only 1\.5% of training examples means the model is picking up structural cues from the program text, not just frequency patterns\.

Calibration and uncertainty\.Beyond top\-1 accuracy, distribution framing improves how well predicted probabilities track reality\. In a prompting study over the aggregated groups, predicting a distribution with a reasoning budget lowers Expected Calibration Error from 0\.205 \(point\-prediction baseline\) to 0\.169, and model entropy correlates with program nondeterminism \(Spearmanρ=0\.412\\rho=0\.412,p=0\.007p=0\.007\)\. In the multi\-step probe, the KL model assigns higher entropy to harder programs: 0\.945 bits on leak programs vs\. 0\.773 on race programs\. This is uncertainty the cross\-entropy model cannot express by design\.

The single\-step boundary\.Table[6](https://arxiv.org/html/2606.17508#S7.T6)summarizes the 15\-step coherence probe on 54 GoKer programs\. The first predicted step reflects real execution semantics: on leak programs, predictions areGoBlock\-dominated withP\(GoUnblock\)P\(\\textsf\{GoUnblock\}\)near zero, matching the signature described in Section[8](https://arxiv.org/html/2606.17508#S8)\. However, mean survival is roughly one step before the model produces a scheduler\-invalid transition\. This was expected: the model was trained only on single steps, so using its predictions autoregressively exposes the boundary quickly\. The probe is useful because it locates that boundary precisely rather than leaving it as an open question\.

Table 6:Multi\-step coherence probe on GoKer \(15 steps, 3 samples per program\)\. The first predicted step reflects the leak signature, but mean survival drops to roughly one step before the model produces invalid transitions, confirming that single\-step training does not transfer to multi\-step simulation\.
## 8The Select\-Block Leak Signature

One distributional pattern in our data has a formal explanation rather than a statistical one\. In a subclass of goroutine leaks, we consistently observeP\(GoUnblock\)=0P\(\\textsf\{GoUnblock\}\)=0at every trace depth across all five runs of those programs\.

Consider a goroutineGGthat enters aselectstatement at timett, where none of the case conditions can ever be satisfied by the remaining execution\. We call this a*select\-block leak*\. For suchGG, aGoUnblockevent is impossible for allt′\>tt^\{\\prime\}\>t, and soP\(GoUnblock\)=0P\(\\textsf\{GoUnblock\}\)=0in the empirical next\-event distribution for any split taken afterGG’s firstGoBlock\.

The reason is straightforward\. In Go,GoUnblockfires only when another goroutine sends to or closes a channel thatGGis waiting on, or releases a mutex thatGGis waiting to acquire\. If no reachable goroutine in the remaining execution can do either, thenGoUnblocknever fires\. This is a consequence of the scheduler’s semantics, not something the model learns\. It holds regardless of how manyselectcasesGGhas; we verified this for both two\-case and four\-case selects\.

The signature is mechanism\-specific and we are careful not to overstate it\. Only leaks where the goroutine reaches a permanently blocked select*before*anyGoUnblockevent appears in the trace show the all\-zero pattern\. Leaks that do legitimate work first, for example receiving items from a channel before blocking, showP\(GoUnblock\)\>0P\(\\textsf\{GoUnblock\}\)\>0at early split depths\. We tried using this signal as an unsupervised anomaly detector and found it separates buggy from clean programs only weakly at our scale \(Cohen’sd=0\.29d=0\.29, not significant atn=9n\{=\}9\)\. We present it as a precise characterization of one leak class, not as a general\-purpose detector\.

## 9Threats to Validity

Our grouping of runs by split depth is an approximation\. We treat runs at the same depth as sharing a prefix family, but because interleavings differ, the actual prefixes are structurally comparable rather than identical\. This can blur the empirical distribution targets, particularly for programs with high nondeterminism where many interleavings are equally likely\.

The runtime tracer does not expose channel buffer occupancy, mutex ownership, or local variable values\. The model reasons from a partial view of program state, which bounds achievable accuracy and explains some of the confusion in event types likeGoUnblockthat depend on synchronization state the model cannot see\.

Several of our secondary analyses rest on small samples\. The calibration correlation \(ρ=0\.412\\rho=0\.412,p=0\.007p=0\.007\) is statistically significant, but the anomaly detection result \(n=9n\{=\}9, Cohen’sd=0\.29d=0\.29\) is not\. We report effect sizes throughout rather than drawing conclusions from isolatedpp\-values\.

The in\-distribution Gemini result in Table[4](https://arxiv.org/html/2606.17508#S6.T4)used an earlier model generation and a different prompt format\. We include it to show the relative improvement from fine\-tuning, but we do not compare it numerically to the out\-of\-distribution results in Table[3](https://arxiv.org/html/2606.17508#S6.T3)\.

## 10Conclusion

We reformulated next\-event prediction for concurrent programs as distribution estimation over empirical nondeterministic targets\. A 7B model fine\-tuned on fewer than a thousand Go traces generalizes to real production concurrency bugs better than a strong zero\-shot large model, and distribution training with a KL objective matches cross\-entropy accuracy while improving calibration\. The 35–36% accuracy ceiling traces to rare\-event failure and train\-test distribution shift, and multi\-step predictions lose coherence after roughly one step because the model was never trained on trajectories\.

Three directions follow directly from these findings\. Training on trajectories rather than individual steps is the most direct path to extending multi\-step coherence\. Encoding channel and mutex state in the trace representation would let the model reason about synchronization, not just goroutine lifecycle events\. Rebalancing the training set toward the event mix of real Go code would address the class\-imbalance ceiling that currently prevents the model from learning rare but important event types\. We release the dataset, the cross\-entropy and KL adapters, and all tooling to support this line of work\.

## Appendix: Artifacts and Compute

All artifacts are publicly released\. Code, tooling, and scripts are available at\[[7](https://arxiv.org/html/2606.17508#bib.bib7)\]\. The benchmark dataset is released as\[[8](https://arxiv.org/html/2606.17508#bib.bib8)\]\. The cross\-entropy and KL fine\-tuned 7B adapters are released as\[[9](https://arxiv.org/html/2606.17508#bib.bib9)\]and\[[10](https://arxiv.org/html/2606.17508#bib.bib10)\]respectively; both are 4\-bit QLoRA adapters runnable on a single 20 GB GPU\.

Total compute cost was $12 for fine\-tuning on a RunPod RTX 4000 Ada and $60 for Gemini API inference across all zero\-shot evaluations\.

## References

- \[1\]FAIR CodeGen Team: CWM: An Open\-Weights LLM for Research on Code Generation with World Models\. arXiv:2510\.02387 \(2025\)
- \[2\]Rahmani, B\.: Debugging Code World Models\. arXiv:2602\.07672 \(2026\)
- \[3\]Huang, J\., Mahmud, T\., Păsăreanu, C\., Yang, G\.: CONCUR: Benchmarking LLMs for Concurrent Code Generation\. arXiv:2603\.03683 \(2026\)
- \[4\]Tu, T\., Liu, X\., Song, L\., Zhang, Y\.: Understanding Real\-World Concurrency Bugs in Go\. In: ASPLOS \(2019\)
- \[5\]Zhang, Y\., Schwarzschild, A\., Carlini, N\., Kolter, Z\., Ippolito, D\.: Forcing Diffuse Distributions out of Language Models\. arXiv:2404\.10859 \(2024\)
- \[6\]Baldelli, D\., Kuriakose, S\., Hashemzadeh, M\., Zouaq, A\., Chandar, S\.: Probabilistic Calibration Is a Trainable Capability in Language Models\. arXiv:2605\.11845 \(2026\)
- \[7\]Hapuarachchi, K\.: Weave: Concurrent Code World Models – Code, Tooling, and Evaluation Scripts\.https://github\.com/kaviru2/weave\(2026\)
- \[8\]Hapuarachchi, K\.: weave\-bench: Concurrent Go Execution Trace Benchmark Dataset\. HuggingFace Datasets\.https://huggingface\.co/datasets/kavirubc/weave\-bench\(2026\)
- \[9\]Hapuarachchi, K\.: weave\-ccwm\-qwen2\.5\-coder\-7b\-lora: Cross\-Entropy Fine\-Tuned Adapter\. HuggingFace\.https://huggingface\.co/kavirubc/weave\-ccwm\-qwen2\.5\-coder\-7b\-lora\(2026\)
- \[10\]Hapuarachchi, K\.: weave\-ccwm\-qwen2\.5\-coder\-7b\-kl\-lora: KL Distribution Fine\-Tuned Adapter\. HuggingFace\.https://huggingface\.co/kavirubc/weave\-ccwm\-qwen2\.5\-coder\-7b\-kl\-lora\(2026\)
When the Next Step Is Not One Step: Distribution-Aware Execution Modeling for Concurrent Go Programs

Similar Articles

When Is Next-Token Prediction Useful? Marginalization, Ergodicity, Mixture Identifiability, Local Sufficiency, RAG, Tools, and Programming

Beyond Prediction: Tail-Aware Scheduling for LLM Inference

Semantic Step Prediction: Multi-Step Latent Forecasting in LLM Reasoning Trajectories via Step Sampling

Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport

Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance

Submit Feedback

Similar Articles

When Is Next-Token Prediction Useful? Marginalization, Ergodicity, Mixture Identifiability, Local Sufficiency, RAG, Tools, and Programming
Beyond Prediction: Tail-Aware Scheduling for LLM Inference
Semantic Step Prediction: Multi-Step Latent Forecasting in LLM Reasoning Trajectories via Step Sampling
Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance