Thermocompute constant time inference [P]
Summary
Thermocompute is a PyTorch emulator for thermodynamic probabilistic computing that enables neural network layers to achieve constant modeled physical time inference by exploiting parallel thermodynamic substrate, with immediate GPU-usable stochastic layers.
View Cached Full Text
Cached at: 05/24/26, 04:52 PM
arccoxx/thermocompute
Source: https://github.com/arccoxx/thermocompute
thermocompute
thermocompute is a PyTorch-first emulator for thermodynamic probabilistic
computing. It provides software models of p-bits/PDITs, PMODE Gaussian
samplers, PMOG mixture samplers, generic torch.distributions probability
families, low-precision quantization-aware FFNs, flow-matching generative
models, Gaussian-process layers, thermodynamic convolutional layers, and
fixed-observation-time thermodynamic neuron layers.
The punchline: thermocompute lets us study neural layers where width can
behave more like parallel fabric than sequential latency. Today, that matters
in two ways:
- GPU-only path: use thermodynamic layers directly as PyTorch modules. In current CUDA experiments, wall-clock inference often enters a useful plateau across moderate width ranges because the stochastic width dimension is vectorized and the GPU still has parallel headroom.
- Sim-to-real path: the same code models future thermodynamic hardware, where increasing width adds parallel physical units and does not lengthen the fixed observation window.
Under the massively parallel thermodynamic hardware model, we can run variable-width fixed-depth neural inference in constant modeled physical time. Increasing width adds parallel thermodynamic units; it does not lengthen the fixed physical evolution window.
That is the central research result: a wider thermodynamic layer can expose more stochastic nonlinear features while reporting the same physical inference time. Classical dense digital layers must touch width-dependent weights and activations. The thermodynamic substrate model turns that width term into parallel physical fabric.
The flagship demonstration is scripts/run_superiority_demo.py: it sweeps
thermodynamic transformer FFN width up to 4096 units, projects the same model
to much larger widths, and shows a flat modeled physical-time line while the
dense digital FFN work proxy climbs with width. This is the cleanest current
package result: variable-width neural inference in constant modeled physical
time under the parallel thermodynamic substrate model.
The practical near-term bet is GPU-only thermodynamic inference. The emulator is not only a future-hardware simulator; it is also a usable stochastic neural layer family for CUDA today. If your GPU has enough unused parallel capacity, widening the thermodynamic hidden array can be much cheaper in wall-clock time than widening a classical dense digital FFN. That plateau is empirical, not a guarantee, and the benchmark artifacts report wall-clock timing separately so you can find the saturation point on your own hardware.
The second bet is sim-to-real. The same APIs let you prototype algorithms as physical stochastic processes before dedicated thermodynamic hardware is available, then compare modeled physical time against conventional sequential sampling or digital neural computation.
This is not a claim that PyTorch magically has constant wall-clock time for all sizes. The emulator runs on CPU or GPU and pays normal software costs. The important empirical claim is narrower and more useful: on CUDA, there can be a substantial width range where thermodynamic emulator wall time is roughly flat because the GPU is executing the stochastic width in parallel. Eventually memory bandwidth, occupancy, registers, launch overhead, or tensor size should win and wall time should rise.
The most important distinction in the package is:
PyTorch wall time = how long this emulator takes on today's CPU/GPU
CUDA plateau = observed wall-clock regime before GPU saturation
modeled physical time = how long the corresponding thermodynamic substrate runs
The first one is a practical software result. The second one is the GPU-only opportunity. The third one is the hardware-theory anchor.
License And Commercial Use
Current versions of thermocompute are source-available under the
PolyForm Noncommercial License 1.0.0. The code is public for research,
inspection, experimentation, education, and other noncommercial use.
Commercial use requires a separate written commercial license from the project owner. See COMMERCIAL.md.
Contributions are welcome only under terms that preserve this dual-license option. See CONTRIBUTING.md.
This repository was previously published under MIT. Those older versions keep the terms they were already released under, but future versions are distributed under the current noncommercial license unless the project owner grants different terms. The project owner may relicense future versions or grant commercial licenses at their discretion.
Use Cases
1. GPU-Only Thermodynamic Layers Today
This is the path to prioritize now. thermocompute gives you PyTorch modules
that can be dropped into real experiments:
ThermodynamicFFN:[batch, seq, embed] -> [batch, seq, embed]ThermodynamicTransformerBlock: pre-norm attention plus a thermodynamic FFNThermodynamicTransformerLayer: research layer with sampled PDIT attention and thermodynamic feed-forward widthThermodynamicConv2d: image/local-feature layer with a thermodynamic hidden channel fabricThermodynamicCNNClassifier: tiny classifier wrapper for lightweight CNN experimentsExactGaussianProcessRegressor: exact small-data GP posterior inferenceRandomFeatureGaussianProcessLayer: scalable GP-style layer with fast ridge readout fittingThermodynamicMLP: fixed-time stochastic MLPs for non-sequence experiments
The goal is not to beat every optimized dense kernel on every shape. The goal is to exploit GPU parallelism to emulate very wide stochastic thermodynamic feature arrays where increasing width may be much cheaper than increasing classical dense FFN width. That makes the package useful for:
- wide stochastic feature maps
- reservoir/readout experiments
- uncertainty-aware transformer blocks
- local image and feature-map experiments
- Gaussian-process regression and random-feature Bayesian layers
- synthetic sequence modeling
- energy-based or diffusion-like latent transformations
- flow-matching generative sampling with fewer neural evaluations
- fast ablations before thermodynamic hardware exists
The default engineering path is no replica:
n_replicas = 1
tempering = False
This is the fastest and most memory-efficient mode. Parallel tempering remains available, but it is an optional exploration method for rugged objectives. In our current small end-to-end experiment it added cost and provided little benefit, so the package should be evaluated first in the no-replica setting.
Use the benchmark scripts to measure your own plateau:
python scripts/run_benchmarks.py --outdir artifacts
python scripts/run_superiority_demo.py --outdir artifacts
Read the wall_ms_median fields as GPU software measurements and
physical_time fields as modeled substrate measurements.
2. Sim-To-Real Thermodynamic Hardware
The same layer definitions are also a clean hardware target. A thermodynamic substrate would replace the numerical Langevin loop with physical evolution and readout. The API already exposes the hardware-relevant knobs:
t_f: fixed observation windowdt: emulator integration steptemperature: noise strengthj2,j3,j4: quartic potential shapen_replicas,tempering,swap_interval: replica and tempering schedule
The no-replica path maps to one thermodynamic array and one readout. Replica ladders map to additional arrays and should be treated as an expensive capability, not the default substrate assumption.
That makes thermocompute a bridge: run GPU-only experiments now, then map
successful thermodynamic blocks to future physical arrays.
3. Research Proofs And Claim Boundaries
The package ships benchmark artifacts and JSON schemas so results are inspectable. The core proof is modeled physical-time scaling. The promising engineering observation is the CUDA wall-clock plateau. They are related, but they are not the same claim.
Motivation
Modern probabilistic inference usually relies on digital iteration: Gibbs sampling, Langevin updates, diffusion steps, MCMC sweeps, or repeated neural network evaluations. Those methods are powerful, but their runtime grows with the number of variables, the number of sampling steps, and often the number of model layers.
Thermodynamic computing starts from a different premise. Instead of simulating randomness digitally, the hardware uses physical thermal noise as the sampling resource. Local stochastic units relax under programmable biases, couplings, temperatures, and nonlinear potentials. A full array of units evolves simultaneously for a chosen physical window, then the final voltages or states are read out.
thermocompute is built around that idea:
- Use PyTorch tensors to emulate many thermodynamic units in parallel.
- Treat CUDA as a practical first deployment target, not just a simulator.
- Make physical time explicit through
tau0,dt, andt_f. - Support discrete, continuous, and multimodal probabilistic primitives.
- Build thermodynamic neural layers whose nonlinear activation is produced by fixed-time stochastic dynamics instead of a hand-coded activation function.
- Measure whether GPU wall time enters a useful plateau as thermodynamic width grows.
- Demonstrate the key scaling idea: variable-width neural inference can keep constant modeled physical time when the extra width is mapped to parallel thermodynamic units.
- Run experiments that separate emulator wall time from modeled physical time.
The library is intentionally minimal. It does not try to be a full model zoo or a hardware driver. It focuses on the smallest set of abstractions needed to ask the research question cleanly:
Can we increase neural width and representational capacity without increasing
modeled physical inference latency?
For the thermodynamic neuron and transformer FFN layers in this release, the answer is yes under the parallel-substrate model. The benchmarks make that claim measurable and keep it separate from ordinary PyTorch runtime.
Installation
From this repository:
python -m pip install -e .
For local tests without installing pytest:
python scripts/run_tests.py
If pytest is installed:
python -m pytest
Quick Start
import torch
from thermocompute import (
PMODE,
PMOG,
ThermodynamicNeuronConfig,
ThermodynamicTransformerBlock,
ThermodynamicTransformerConfig,
ThermodynamicMLP,
ThermodynamicTransformerLayer,
fit_transformer_end_to_end_cold,
fit_transformer_end_to_end_parallel_tempering,
fit_transformer_readout_parallel_tempering,
fit_transformer_readout_ridge,
)
device = "cuda" if torch.cuda.is_available() else "cpu"
pmode = PMODE(tau0=100e-9).to(device)
mu = torch.zeros(4096, device=device)
sigma = torch.ones(4096, device=device) * 0.75
samples = pmode.sample(mu, sigma, n_samples=64, t_total=1e-6)
pmog = PMOG(n_components=3, tau0=100e-9).to(device)
mix_samples, modes = pmog.sample(
logits=torch.tensor([0.1, 1.2, -0.3], device=device),
means=torch.tensor([-2.0, 0.25, 2.25], device=device),
scales=torch.tensor([0.35, 0.25, 0.45], device=device),
n_samples=2048,
t_total=1e-6,
)
model = ThermodynamicMLP([1, 32, 1], t_f=1.0, dt=0.05).to(device)
x = torch.linspace(-2, 2, 128, device=device).unsqueeze(-1)
y, info = model(x, return_info=True)
print(y.shape)
print(info.physical_time)
layer = ThermodynamicTransformerLayer(
embed_dim=64,
num_heads=4,
thermo_hidden_dim=512,
attention_mode="pdit",
n_attention_samples=4,
attention_t_f=0.05,
t_f=0.4,
dt=0.04,
memory_efficient_chunk_size=128,
).to(device)
tokens = torch.randn(8, 32, 64, device=device)
tokens_out, layer_info = layer(tokens, causal=True, return_info=True)
tokens_out_low_mem, _ = layer(tokens, causal=True, chunk_size=64, return_info=True)
print(tokens_out.shape)
print(layer_info.physical_time)
target_tokens = torch.sin(tokens)
fit = fit_transformer_readout_ridge(layer, tokens, target_tokens, ridge=1e-1)
print(fit.train_mse)
pt_fit = fit_transformer_readout_parallel_tempering(
layer,
tokens,
target_tokens,
ridge=1e-1,
keep_fraction=0.35,
n_tempering_replicas=6,
n_tempering_steps=24,
)
print(pt_fit.selected_features, pt_fit.swap_acceptance)
e2e_fit = fit_transformer_end_to_end_parallel_tempering(
layer,
tokens,
target_tokens,
n_tempering_replicas=5,
n_tempering_steps=40,
learning_rate=4e-3,
)
print(e2e_fit.final_train_loss)
cold_fit = fit_transformer_end_to_end_cold(
layer,
tokens,
target_tokens,
n_steps=40,
learning_rate=4e-3,
)
print(cold_fit.final_train_loss)
block_config = ThermodynamicTransformerConfig(
embed_dim=64,
num_heads=4,
thermo_hidden_dim=512,
neuron=ThermodynamicNeuronConfig(t_f=0.2, dt=0.04),
memory_efficient_chunk_size=128,
)
block = ThermodynamicTransformerBlock(block_config).to(device)
block_out, block_info = block(tokens, causal=True, return_info=True)
print(block_info.physical_time)
Package Layout
thermocompute/
primitives.py p-bits, PDIT, PMODE, PMOG, Ising energy
distributions.py generic torch.distributions wrappers and adapters
neurons.py fixed-time thermodynamic neuron layers and MLPs
cnn.py thermodynamic convolution layers and toy CNN classifier
transformer.py thermodynamic transformer attention and FFN blocks
flow_matching.py conditional flow matching and thermodynamic velocity fields
gaussian_process.py exact GP regression and random-feature GP layers
precision.py low-precision quantization-aware FFN support
integration.py production-shaped FFN/block wrappers and replacement helper
training.py cold/PT end-to-end and readout training helpers
experiments.py smoke checks, proof-of-concept checks, scaling studies
benchmarks.py research-proof width and baseline benchmarks
memory.py FFN memory scaling estimators
metrics.py result serialization and small metric helpers
config.py device and random seed utilities
cuda_ext.py optional CUDA extension loader
cuda/ experimental custom CUDA kernel source
scripts/
run_benchmarks.py
run_stress.py
run_tests.py
run_smoke.py
run_poc.py
run_flow_matching_experiment.py
run_cnn_experiment.py
run_gaussian_process_experiment.py
run_readout_training_comparison.py
run_precision_experiments.py
run_precision_training_comparison.py
run_experiments.py
tests/
test_smoke.py
test_poc.py
Core Concepts
Physical Time
The library tracks modeled physical time separately from software runtime.
tau0is the relaxation timescale for primitive samplers such as PMODE.dtis the numerical integration step used by the emulator.t_fis the fixed observation time for thermodynamic neurons.
For a fixed-depth thermodynamic MLP, modeled physical time is the sum of the fixed windows for each sequential layer. It does not grow with layer width, batch size, or number of parallel replicas in the hardware model. In PyTorch, wall time still depends on tensor sizes, memory bandwidth, and kernel overhead.
This is the central reason the package exists: it gives a concrete software interface for variable-width neural inference in constant modeled physical time. The width dimension is treated as parallel physical fabric, not sequential digital work.
For a single thermodynamic layer, the modeled latency is:
T_layer = n_steps * dt ~= t_f
and for a fixed-depth unpipelined stack:
T_model = sum_l T_layer_l
There is no width term in that physical-time expression. Width still matters: it increases parameter count, memory in the emulator, hardware area in the substrate model, and total power. The research advantage is specifically that width does not increase the modeled forward-pass latency when all units evolve in parallel for the same fixed observation window.
Memory-Efficient Inference
Very wide thermodynamic layers are memory-bound before they are conceptually latency-bound. A dense classical FFN and a thermodynamic FFN both carry width-linear parameter memory:
M_params ~= q * H * (D + E)
where q is bytes per scalar, H is hidden thermodynamic width, D is input
dimension, and E is output dimension. The thermodynamic emulator also needs
state memory:
M_state ~= q * (batch * seq) * H * (replicas + overhead)
For inference, thermocompute now supports chunked projected evaluation. The
full width still exists in the weights, but the emulator only materializes a
slice of hidden thermodynamic state at a time:
M_state_chunked ~= q * (batch * seq) * C * (replicas + overhead)
where C = memory_efficient_chunk_size. The readout is accumulated chunk by
chunk:
y = bias + sum_i W_out[:, i:i+C] * tanh(thermo_chunk_i(x))
This keeps peak activation/state memory proportional to C instead of H.
It does not reduce parameter memory, and it can increase software wall time
because chunks run sequentially in the emulator. The modeled hardware physical
time remains t_f, because the chunks represent parallel physical units on
the target substrate.
Enable it at construction:
layer = ThermodynamicTransformerLayer(
embed_dim=4096,
num_heads=32,
thermo_hidden_dim=500_000,
t_f=0.2,
dt=0.04,
memory_efficient_chunk_size=8192,
).to(device)
or override per call:
y, info = layer(tokens, chunk_size=4096, return_info=True)
The same option is available on ThermodynamicFFN and
ThermodynamicTransformerConfig.
For quick planning, the package includes first-order memory estimators:
from thermocompute import estimate_classical_ffn_memory, estimate_thermo_ffn_memory
classical = estimate_classical_ffn_memory(
input_dim=4096,
hidden_dim=500_000,
batch_tokens=2048,
dtype_bytes=2,
)
thermo = estimate_thermo_ffn_memory(
input_dim=4096,
hidden_dim=500_000,
batch_tokens=2048,
dtype_bytes=2,
replicas=1,
chunk_size=8192,
)
print(classical.peak_bytes / 1e9)
print(thermo.peak_bytes / 1e9)
Best-Case Memory Scaling: No Replicas
For production GPU use, the best case is the no-replica thermodynamic FFN:
R = 1
tempering = False
memory_efficient_chunk_size = C
Compare a classical dense FFN and a no-replica thermodynamic FFN with:
D = input/embed dimension
E = output/embed dimension
H = hidden thermodynamic width
N = batch * sequence length
q = bytes per scalar
C = chunk size
Classical dense FFN:
params_classical ~= H(D + E)
activation_classical_peak ~= N H
M_classical_peak ~= q [ H(D + E) + N H + N E ]
No-replica thermodynamic FFN, unchunked:
params_thermo ~= H(D + E + 4)
state_thermo_peak ~= N H (1 + overhead)
M_thermo_peak ~= q [ H(D + E + 4) + N H(1 + overhead) + N E ]
No-replica thermodynamic FFN, chunked:
params_thermo ~= H(D + E + 4)
state_thermo_chunked_peak ~= N C (1 + overhead)
M_thermo_chunked_peak ~= q [ H(D + E + 4) + N C(1 + overhead) + N E ]
So the best-case no-replica memory result is:
parameter memory: O(H), essentially comparable to classical dense FFN
state memory: O(NC), independent of full width H when chunked
replica overhead: none
This does not make thermodynamic layers parameter-light. The current dense
readout still stores D -> H current weights and H -> E readout weights.
The win is that peak stochastic state/activation memory can be capped by C
instead of H, and the modeled physical-time claim remains tied to the full
parallel array.
For classical dense FFNs, the arithmetic cost still scales with H:
F_classical ~= 2N H(D + E)
For the thermodynamic hardware model, width maps to physical sites:
T_thermo_physical ~= t_f
For the PyTorch emulator, chunking trades memory for software time:
num_chunks = ceil(H / C)
Use chunking when VRAM is the bottleneck. Use larger chunks when wall-clock latency is the bottleneck and VRAM is available.
Cold no-replica training also works with memory_efficient_chunk_size, but
training memory is harder than inference memory because autograd may retain
chunk graphs and integration intermediates for backward. Treat chunked cold
training as a compatibility path today; the strongest memory guarantee is for
inference. Future checkpointing or custom backward kernels should improve the
training side.
Low-Precision Thermodynamic FFNs
The package now includes a quantization-aware low-precision path for thermodynamic FFNs. It supports the common ML precision ladder:
fp64, fp32, tf32-style, fp16, bf16, fp8_e4m3fn, fp8_e5m2,
int8, int4, int2, binary
Use QuantizedThermodynamicFFN when you want the forward pass to experience
low-precision weights, currents, states, and readout activations while keeping
trainable master parameters in ordinary floating point:
from thermocompute import (
QuantizationConfig,
QuantizedThermodynamicFFN,
ThermodynamicNeuronConfig,
fit_quantized_ffn_mse,
)
model = QuantizedThermodynamicFFN(
embed_dim=256,
hidden_dim=4096,
quantization=QuantizationConfig(format="int4", per_channel=True),
neuron_config=ThermodynamicNeuronConfig(t_f=0.08, dt=0.04, temperature=0.0),
memory_efficient_chunk_size=256,
)
y = model(tokens)
result = fit_quantized_ffn_mse(model, tokens, targets, n_steps=32)
The low-bit integer formats use symmetric fake quantization plus a straight-through estimator, so training remains differentiable:
forward: quantized/dequantized int4 or int8 values
backward: gradient passes through the quantizer to the master parameters
This is intentionally framed as quantization-aware emulation, not a claim that
every backend has native int4 training kernels. See
docs/precision_scheme.md for the exact method and
scripts/run_precision_experiments.py for small low-memory experiments.
For a direct comparison against standard and mixed-precision training:
python scripts/run_precision_training_comparison.py --device cpu --steps 20 --repeats 3 --outdir artifacts
The current tiny CPU comparison shows fp16/bf16/fp8/int8/int4 training tracking standard fp32 closely on final eval loss, while int2 and binary remain usable but less accurate. Treat this as a smoke-scale feasibility result, not a claim of production low-bit training superiority.
Current checked-in precision comparison:
| Method | Mean Final Eval Loss | Ratio vs fp32 |
|---|---|---|
| standard fp32 | 0.201735 | 1.0000 |
| mixed autocast bf16 | 0.201726 | 0.99995 |
| mixed autocast fp16 | 0.201736 | 1.00000 |
| quantized fp16 | 0.201735 | 1.00000 |
| quantized bf16 | 0.201711 | 0.99988 |
| quantized fp8 e4m3fn | 0.201785 | 1.00024 |
| quantized fp8 e5m2 | 0.202106 | 1.00184 |
| quantized int8 | 0.201762 | 1.00013 |
| quantized int4 | 0.200814 | 0.99543 |
| quantized int2 | 0.215432 | 1.06789 |
| quantized binary | 0.204009 | 1.01127 |
This table is from artifacts/precision_training_comparison.json.
Flow-Matching Diffusion
Flow matching is a natural generative interface for thermocompute: train a
velocity field that transports simple Gaussian noise into a target
distribution, then sample by integrating a probability-flow ODE. Compared with
long iterative diffusion samplers, the immediate speed lever is fewer neural
function evaluations:
speedup proxy ~= diffusion_steps / flow_steps
thermocompute includes a classical time-conditioned flow MLP and a
thermodynamic flow velocity model:
from thermocompute import (
FlowVelocityMLP,
ThermodynamicFlowVelocity,
ThermodynamicNeuronConfig,
fit_flow_matching_end_to_end,
fit_flow_matching_readout_ridge,
make_mog2d,
sample_flow,
)
data = make_mog2d(384)
model = ThermodynamicFlowVelocity(
data_dim=2,
embed_dim=16,
thermo_hidden_dim=48,
neuron_config=ThermodynamicNeuronConfig(t_f=0.08, dt=0.04, temperature=0.0),
memory_efficient_chunk_size=16,
)
fit = fit_flow_matching_readout_ridge(model, data, n_pairs=512, ridge=1e-2)
samples = sample_flow(model, 128, n_flow_steps=4)
Run the lightweight CPU experiment:
python scripts/run_flow_matching_experiment.py --device cpu --train-steps 96 --sample-count 128 --outdir artifacts
Current checked-in flow result on an eight-mode 2D Gaussian mixture:
| Model | Best Flow Steps | MMD² To Reference | Wall ms | Eval Speedup vs 50-Step Diffusion |
|---|---|---|---|---|
| classical flow MLP | 8 | 0.051954 | 0.550 | 6.25x |
| thermodynamic flow | 1 | 0.062004 | 0.542 | 50.0x |
The thermodynamic one-step result is the interesting research signal: it is not as accurate as the best 8-step classical flow in this tiny run, but it reaches a comparable distribution score with a single velocity evaluation. This is a toy CPU feasibility check, not a production diffusion benchmark.
Flow matching now has two explicit training modes:
fit_flow_matching_readout_ridge: freeze the thermodynamic feature fabric and solve the velocity readout in one ridge system.fit_flow_matching_end_to_end: ordinary no-ridge AdamW training through the full model.
Thermodynamic CNNs
CNN support covers the local-feature case: every receptive field is converted to a current vector, the hidden thermodynamic channel bank evolves for the same fixed observation window, and a linear readout maps those thermodynamic features into ordinary output channels.
image patch -> current bank -> fixed-time thermodynamic channels -> output channels
The layer keeps the CNN contract:
[batch, channels, height, width] -> [batch, out_channels, out_height, out_width]
A minimal classifier looks like this:
from thermocompute import ThermodynamicCNNClassifier, ThermodynamicNeuronConfig
model = ThermodynamicCNNClassifier(
in_channels=1,
n_classes=2,
conv_channels=8,
thermo_channels=24,
neuron_config=ThermodynamicNeuronConfig(t_f=0.08, dt=0.04, temperature=0.0),
memory_efficient_chunk_size=12,
)
logits, info = model(images, return_info=True)
print(info.physical_time) # 0.08
Run the lightweight CPU coverage experiment:
python scripts/run_cnn_experiment.py --device cpu --train-steps 80 --outdir artifacts
Current checked-in toy result on an 8x8 vertical-vs-horizontal bar task:
| Model | Final Loss | Final Accuracy | Fit Wall ms | Modeled Physical Time |
|---|---|---|---|---|
| classical tiny CNN | 0.000015 | 1.000 | 92.0 | 0.0 |
| thermodynamic CNN | 0.043337 | 1.000 | 459.4 | 0.08 |
This result is deliberately modest: it proves the package can express, differentiate, chunk, train, and serialize convolutional thermodynamic modules. It is not a production computer-vision benchmark. The important coverage point is that the fixed-time thermodynamic width idea is not restricted to transformers or MLPs; it also maps naturally to local receptive fields.
Fast Readout Training For Flow And CNNs
The package now exposes the same training split for flow matching and CNNs:
fast readout ridge = freeze thermodynamic fabric, solve final readout
end-to-end no ridge = train the whole differentiable model with AdamW
Run the CPU-light comparison:
python scripts/run_readout_training_comparison.py --device cpu --flow-steps 96 --cnn-steps 80 --outdir artifacts
Current checked-in result:
| Task | Method | Final Metric | Fit Wall ms | Notes |
|---|---|---|---|---|
| flow matching | readout ridge | loss 2.6620, one-step MMD2 0.0522 | 2.51 | fast frozen thermodynamic velocity readout |
| flow matching | end-to-end no ridge | loss 2.6482, one-step MMD2 0.0550 | 325.81 | slightly lower supervised loss, much slower CPU training |
| CNN bars | readout ridge | accuracy 1.000, loss 0.6612 | 5.65 | direct ridge from pooled thermodynamic hidden channels |
| CNN bars | end-to-end no ridge | accuracy 1.000, loss 0.0433 | 501.16 | much better confidence, much slower CPU training |
The pattern is exactly what we should expect at this stage. Ridge readout training is dramatically faster because it avoids iterative backpropagation through the thermodynamic dynamics. End-to-end training can improve task metrics because it adapts the whole feature generator, but it pays normal software training costs. The research path is to use ridge/readout alignment wherever a large thermodynamic fabric already provides enough useful features, and reserve end-to-end training for cases where the feature fabric itself must move.
For CNNs, ridge fitting switches the classifier to readout_mode="thermo" so
the readout is solved directly from pooled thermodynamic hidden channels. To
load that state dict later, instantiate the classifier with the same
readout_mode.
Gaussian Processes
Gaussian processes are now a supported probabilistic layer family. The exact path gives small-data posterior inference with mean, covariance, variance, and posterior samples. The scalable layer path uses random Fourier features as an RBF GP approximation and fits the readout with the same fast ridge-alignment idea used elsewhere in the package.
from thermocompute import (
ExactGaussianProcessRegressor,
RandomFeatureGaussianProcessLayer,
RBFKernelConfig,
fit_gp_readout_ridge,
make_gp_regression_data,
)
train_x, train_y = make_gp_regression_data(32)
test_x, _ = make_gp_regression_data(96, noise=0.0)
exact = ExactGaussianProcessRegressor(
kernel=RBFKernelConfig(lengthscale=0.75, output_scale=1.0),
noise=0.04**2,
)
exact.fit(train_x, train_y)
posterior = exact.predict(test_x)
layer = RandomFeatureGaussianProcessLayer(
in_features=1,
out_features=1,
n_random_features=128,
kernel=RBFKernelConfig(lengthscale=0.75, output_scale=1.0),
)
fit_gp_readout_ridge(layer, train_x, train_y, ridge=1e-3)
prediction = layer(test_x)
Run the CPU-light GP experiment:
python scripts/run_gaussian_process_experiment.py --device cpu --train-samples 32 --test-samples 96 --random-features 128 --outdir artifacts
Current checked-in result:
| Model | Test RMSE | Fit Wall ms | Predict Wall ms | Notes |
|---|---|---|---|---|
| exact GP | 0.0248 | 11.67 | 1.23 | exact posterior covariance and samples |
| random-feature GP layer | 0.0536 | 0.61 | 0.05 | fast ridge readout, scalable layer form |
Exact GP inference is a reference path and scales cubically with training points. The random-feature GP layer is the scalable package path.
BinaryPBit
BinaryPBit is a Bernoulli sampler controlled by a voltage-like input:
from thermocompute import BinaryPBit
pbit = BinaryPBit(beta=2.0)
probabilities = pbit.probabilities(control_voltage)
states = pbit.sample(control_voltage)
It also supports a vectorized Ising-style Gibbs step when paired with
IsingEnergy. States are represented as 0/1, while the energy internally maps
them to -1/+1.
CategoricalPDIT
CategoricalPDIT generalizes the binary p-bit to a categorical sampler:
from thermocompute import CategoricalPDIT
pdit = CategoricalPDIT(beta=1.0)
index = pdit.sample(logits)
This emulates a programmable k-state thermodynamic sampling unit where logits or biases define the categorical distribution.
PMODE
PMODE is a programmable Gaussian sampler. It uses an exact
Ornstein-Uhlenbeck transition rather than a fragile tiny-step approximation:
from thermocompute import PMODE
pmode = PMODE(tau0=100e-9).to(device)
samples = pmode.sample(mu, sigma, n_samples=4096, t_total=1e-6)
Each returned sample represents an independent device evolving through the same
physical window. For diagnostics, PMODE.trajectory(...) returns the time trace
of a single emulated device.
PMOG
PMOG is a probabilistic mixture-of-Gaussians sampler. It couples a categorical
mode selector to PMODE-style Gaussian relaxation:
from thermocompute import PMOG
pmog = PMOG(n_components=4).to(device)
samples, modes = pmog.sample(logits, means, scales, n_samples=20000)
The categorical logits set the mode probabilities, while means and scales
set the local Gaussian parameters. switch_rate can be used to emulate mode
resampling during the physical window.
Generic Probability Distributions
The package does not try to reimplement every named probability law from
scratch. Instead, it exposes a generic distribution layer over
torch.distributions, plus an adapter for custom user-defined distributions.
That gives broad practical coverage: every distribution available in the
installed PyTorch build can be constructed, sampled, and scored through the
same thermocompute API.
import torch
from thermocompute import (
DistributionAdapter,
DistributionSampler,
DistributionSpec,
available_distributions,
make_distribution,
)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(available_distributions())
normal = DistributionSampler(
"normal",
loc=torch.zeros(4096, device=device),
scale=torch.ones(4096, device=device),
).to(device)
samples = normal.sample(128)
logp = normal.log_prob(samples)
beta = make_distribution(
"beta",
concentration1=torch.ones(32, device=device) * 2,
concentration0=torch.ones(32, device=device) * 3,
)
spec = DistributionSpec("poisson", {"rate": torch.ones(16, device=device) * 4})
poisson = spec.build()
custom = DistributionAdapter(
torch.distributions.TransformedDistribution(
torch.distributions.Normal(torch.zeros(8, device=device), torch.ones(8, device=device)),
[torch.distributions.transforms.ExpTransform()],
)
)
custom_samples = custom.sample(64)
This is the maintainable version of “support every distribution”: the package
supports the full PyTorch distribution ecosystem and any custom distribution
object that follows the torch.distributions.Distribution interface. Generic
distribution sampling uses PyTorch’s RNG behavior; use set_seed(...) for
reproducibility.
ThermodynamicNeuronLayer
ThermodynamicNeuronLayer computes a digital input current and then evolves a
quartic stochastic dynamical system for a fixed observation time:
from thermocompute import ThermodynamicNeuronLayer
layer = ThermodynamicNeuronLayer(
in_features=16,
out_features=64,
j2=1.0,
j3=0.0,
j4=1.5,
temperature=1.0,
t_f=0.5,
dt=0.025,
).to(device)
y, info = layer(x, return_info=True)
The potential is:
V(x) = 0.5 * J2 * x^2 + (J3 / 3) * x^3 + 0.25 * J4 * x^4 - I * x
The Euler-Maruyama update emulates overdamped Langevin dynamics under that potential. The layer supports multiple replicas and parallel tempering. The implementation also applies state and force rails, which are a practical emulator equivalent of bounded circuit voltages and prevent quartic blow-ups in explicit integration.
ThermodynamicMLP
ThermodynamicMLP stacks thermodynamic neuron layers:
from thermocompute import ThermodynamicMLP
model = ThermodynamicMLP(
[16, 128, 128, 8],
t_f=0.4,
dt=0.04,
).to(device)
y, info = model(x, return_info=True)
print(info.physical_time)
The current implementation uses digital matrix multiplication to compute input currents, then uses stochastic thermodynamic dynamics as the nonlinear sampling core. This is the right split for a software emulator: it gives the same algorithmic surface while keeping the analog part explicit.
ThermodynamicTransformerLayer
ThermodynamicTransformerLayer formulates a transformer block whose
feed-forward width is a parallel thermodynamic array. The block is:
x1 = x + attention(norm1(x))
x2 = x1 + out_proj(tanh(thermo_ff(norm2(x1))))
The thermodynamic feed-forward core maps each token to input currents for a
variable-width array of quartic neurons. Every neuron in that array evolves for
the same fixed window t_f. Therefore, increasing thermo_hidden_dim changes
emulator tensor size and hardware area, but not modeled physical time.
from thermocompute import ThermodynamicTransformerLayer
layer = ThermodynamicTransformerLayer(
embed_dim=128,
num_heads=8,
thermo_hidden_dim=2048,
attention_mode="softmax",
t_f=0.4,
dt=0.04,
memory_efficient_chunk_size=512,
).to(device)
y, info = layer(tokens, return_info=True)
print(info.feedforward_physical_time)
Attention has two modes:
attention_mode="softmax"uses ordinary differentiable scaled dot-product attention. It is digital token mixing and contributes no thermodynamic physical-time window.attention_mode="pdit"treats each query as a categorical PDIT over keys, samples values, and averages them. This is a stochastic sampled-attention emulator. Setattention_t_fto count a fixed attention sampling window.
The transformer layer is the most direct expression of the package thesis: variable-width neural networks can be inferenced in constant modeled physical time when width is implemented as parallel thermodynamic fabric.
Production-Shaped PyTorch Blocks
For integration with ordinary PyTorch models, use ThermodynamicFFN,
ThermodynamicTransformerBlock, and replace_ffn.
from torch import nn
from thermocompute import (
ThermodynamicNeuronConfig,
ThermodynamicTransformerBlock,
ThermodynamicTransformerConfig,
replace_ffn,
)
config = ThermodynamicTransformerConfig(
embed_dim=128,
num_heads=8,
thermo_hidden_dim=2048,
neuron=ThermodynamicNeuronConfig(t_f=0.2, dt=0.04, n_replicas=2, output="mean"),
memory_efficient_chunk_size=512,
)
block = ThermodynamicTransformerBlock(config).to(device)
out, info = block(tokens, causal=True, return_info=True)
out_low_mem, _ = block(tokens, causal=True, chunk_size=256, return_info=True)
model = nn.Module()
model.ffn = nn.Sequential(nn.Linear(128, 512), nn.GELU(), nn.Linear(512, 128))
replaced = replace_ffn(model, lambda name, module: name == "ffn", config)
ThermodynamicFFN has the same tensor contract as a transformer FFN:
[batch, seq, embed] -> [batch, seq, embed]. It is intentionally residual-free
so it can be dropped into blocks that already own their residual connection.
ThermodynamicTransformerBlock is a complete pre-norm block with attention,
residuals, and a thermodynamic FFN. Both modules support .to(device, dtype),
state_dict save/load, autograd, train/eval mode, and return_info=True.
For very wide layers, set memory_efficient_chunk_size in the config or pass
chunk_size during forward. This keeps peak inference state memory bounded
by the chunk size while preserving the public shape contract.
replace_ffn is deliberately conservative. It only replaces plain
nn.Sequential MLPs whose first linear input size and last linear output size
match config.embed_dim. When it replaces a module, it preserves the target
module’s device, floating dtype, and train/eval mode. That makes it practical
for real PyTorch experiments without pretending to cover every architecture.
This is production-shaped PyTorch support, not a full Hugging Face integration. The core package has no hard Hugging Face dependency and no hardware driver.
Thermodynamic Readout Alignment
fit_transformer_readout_ridge implements Thermodynamic Readout Alignment
(TRA), a fast training analogue for the fixed-time inference model.
TRA treats the thermodynamic transformer feed-forward block as a physical reservoir:
- Run the token stream through attention and the thermodynamic neuron array.
- Collect fixed-time stochastic features from the thermodynamic array.
- Fit only the final projection with one closed-form ridge regression solve.
- Keep inference exactly the same: one fixed thermodynamic evolution window, followed by the learned readout.
from thermocompute import ThermodynamicTransformerLayer, fit_transformer_readout_ridge
layer = ThermodynamicTransformerLayer(
embed_dim=64,
num_heads=4,
thermo_hidden_dim=1024,
t_f=0.2,
dt=0.04,
memory_efficient_chunk_size=256,
).to(device)
tokens = torch.randn(128, 16, 64, device=device)
targets = torch.sin(tokens + 0.25 * torch.roll(tokens, shifts=1, dims=1))
fit = fit_transformer_readout_ridge(
layer,
tokens,
targets,
ridge=1e-1,
feature_repeats=2,
)
print(fit.train_mse)
print(fit.fit_wall_ms)
print(fit.physical_time)
This is not full end-to-end transformer training. It is closer to reservoir computing or extreme learning machines, but with the reservoir generated by fixed-time thermodynamic dynamics. It is valuable because it gives a training path whose expensive nonlinear feature generation has the same parallel physical-time structure as inference. The digital work is the ridge solve and readout programming.
Cold End-To-End Training
fit_transformer_end_to_end_cold exposes ordinary single-replica end-to-end
training. This is the default no-ridge training path: no replica ladder, no
tempering swaps, and the lowest memory footprint among the end-to-end methods.
It is the right first choice for smooth objectives, large widths, and GPU-only
experiments.
from thermocompute import ThermodynamicTransformerLayer, fit_transformer_end_to_end_cold
layer = ThermodynamicTransformerLayer(
embed_dim=32,
num_heads=4,
thermo_hidden_dim=128,
t_f=0.2,
dt=0.04,
memory_efficient_chunk_size=64,
).to(device)
fit = fit_transformer_end_to_end_cold(
layer,
train_tokens,
train_targets,
eval_inputs=eval_tokens,
eval_targets=eval_targets,
n_steps=40,
learning_rate=4e-3,
)
print(fit.final_train_loss)
print(fit.final_eval_loss)
print(fit.fit_wall_ms)
Use cold training when:
- the objective is smooth or easy
- you need the cheapest no-ridge baseline
- memory is tight
- training time matters more than exploration
- you have not yet shown that replica ladders help your scale/task
In the current toy inductive transformer experiment, cold training performs surprisingly well: it reaches nearly the same train and eval loss as the parallel-tempered version while using one model replica instead of five. The PT version gives a small eval improvement, but costs more wall time and memory.
Parallel Tempered End-To-End Training
fit_transformer_end_to_end_parallel_tempering is an optional exploration
method: no ridge solve, no closed-form readout fitting, and no frozen reservoir
assumption, but it keeps several full model copies. Use it only after a cold
single-replica baseline gets stuck or when there is evidence the task has a
rugged loss landscape.
The method keeps several full copies of the transformer layer:
- Each replica trains all differentiable parameters by gradient descent.
- Hot replicas receive more Langevin parameter noise.
- Cold replicas exploit low-loss regions.
- Adjacent replicas periodically swap whole parameter states with a Metropolis parallel-tempering rule.
- The best final replica is loaded back into the original layer.
from thermocompute import (
ThermodynamicTransformerLayer,
fit_transformer_end_to_end_parallel_tempering,
)
layer = ThermodynamicTransformerLayer(
embed_dim=32,
num_heads=4,
thermo_hidden_dim=128,
t_f=0.2,
dt=0.04,
memory_efficient_chunk_size=64,
).to(device)
fit = fit_transformer_end_to_end_parallel_tempering(
layer,
train_tokens,
train_targets,
eval_inputs=eval_tokens,
eval_targets=eval_targets,
n_tempering_replicas=5,
n_tempering_steps=40,
learning_rate=4e-3,
noise_scale=2e-3,
)
print(fit.initial_train_loss)
print(fit.final_train_loss)
print(fit.final_eval_loss)
print(fit.swap_acceptance)
Use parallel-tempered end-to-end training when:
- losses are rugged or multimodal
- runs get stuck in poor basins
- extra memory for full replicas is acceptable
- a small accuracy gain is worth more training compute
This is not currently the package’s main scaling result. We have not yet shown that parallel tempering becomes more valuable at large thermodynamic width. It is included because it is a plausible exploration tool, not because it beat the no-replica path in the first experiment.
Parallel Tempered Mask Alignment
fit_transformer_readout_parallel_tempering is the sparse structural training
method. It uses parallel tempering directly during training.
The method searches over sparse binary masks on the thermodynamic reservoir features:
- Each replica holds a candidate feature mask.
- Low-temperature replicas exploit good sparse masks.
- High-temperature replicas explore by accepting worse masks more often.
- Adjacent replicas swap masks using a Metropolis tempering rule.
- For each mask, the readout is solved by ridge regression on the selected features.
- The best mask is programmed into the output projection as a sparse readout.
from thermocompute import (
ThermodynamicTransformerLayer,
fit_transformer_readout_parallel_tempering,
)
layer = ThermodynamicTransformerLayer(
embed_dim=64,
num_heads=4,
thermo_hidden_dim=1024,
t_f=0.2,
dt=0.04,
memory_efficient_chunk_size=256,
).to(device)
fit = fit_transformer_readout_parallel_tempering(
layer,
tokens,
targets,
ridge=1e-1,
keep_fraction=0.35,
sparsity_penalty=1e-3,
n_tempering_replicas=6,
n_tempering_steps=32,
)
print(fit.train_mse)
print(fit.selected_fraction)
print(fit.swap_acceptance)
This is useful when the hardware or deployment target benefits from sparse readout wiring. The search problem is discrete and nonconvex, so parallel tempering is a natural fit: hot replicas discover alternate feature subsets, while cold replicas preserve strong candidates. Inference remains the same fixed-time thermodynamic pass.
Running Experiments
Smoke checks:
python scripts/run_smoke.py
Proof-of-concept checks:
python scripts/run_poc.py
Full experiment suite:
python scripts/run_experiments.py --outdir artifacts
Research-proof benchmark suite:
python scripts/run_benchmarks.py --outdir artifacts
Flagship superiority demo:
python scripts/run_superiority_demo.py --outdir artifacts
Current experiments include:
- Fixed-physical-time width scaling for thermodynamic MLPs.
- Parallel tempering escape on a double-well landscape.
- PMOG multimodal fidelity against programmed target mixture moments.
- Fixed-physical-time width scaling for thermodynamic transformer FFN width.
- PDIT sampled-attention convergence against softmax attention.
- Thermodynamic Readout Alignment on a nonlinear token mapping.
- Parallel Tempered Mask Alignment for sparse thermodynamic readout training.
- End-to-end parallel-tempered inductive transformer training with no ridge solve.
- Cold single-replica end-to-end inductive training as the default no-ridge training path.
- Superiority demo against dense digital FFN scaling: measured widths up to 4096 and projected widths up to 65536, with modeled physical time flat and digital work increasing.
The benchmark suite writes JSON artifacts using this shape:
{
"name": "benchmark_name",
"metrics": {
"device": "cuda",
"rows": [],
"physical_time_range": 0.0
}
}
The flagship benchmark artifact is fixed_time_advantage.png: thermodynamic
modeled physical time stays flat with width while the classical FFN FLOP proxy
grows.
The flagship demo artifacts are:
superiority_demo.json: source-of-truth metrics and claim boundary.superiority_demo.md: compact academic-style summary table.superiority_latency_advantage.png: fixed thermodynamic physical time vs rising dense digital work.superiority_advantage_factor.png: modeled latency advantage factor as width grows.superiority_loss_vs_width.png: sanity task loss after a matched readout fit.
What This Proves / What It Does Not Prove
See docs/claims.md for the explicit claim boundary. See docs/engineering_assessment.md for the maintainer-style assessment of current strengths, gaps, and next hardening steps.
In short:
- Proves: fixed-depth thermodynamic neuron/transformer blocks report constant modeled physical time as width increases.
- Proves: the benchmark suite separates modeled physical time from PyTorch wall time and classical FLOP proxies.
- Suggests: on CUDA, emulator wall time can remain roughly flat across useful thermodynamic width ranges before hardware saturation.
- Does not prove: PyTorch wall time is universally constant with width.
- Does not prove: training is constant-time or faster than state-of-the-art classical training.
- Does not prove: real chip speedups; this is an emulator with no hardware backend.
Outputs are written as JSON and PNG files under artifacts/.
Optional CUDA Kernels
The default implementation is pure PyTorch and works on CPU or CUDA. The
package also includes an experimental CUDA extension source in
thermocompute/cuda/thermo_kernels.cu and a loader in thermocompute/cuda_ext.py.
The extension is optional. If compilation fails or a CUDA compiler is not available, the package falls back to PyTorch kernels.
from thermocompute.cuda_ext import has_cuda_extension
print(has_cuda_extension())
Scaling Advantage
There are now two scaling stories in the package.
The first is practical and immediate: GPU-only thermodynamic inference can
show a wall-clock plateau. The PyTorch implementation vectorizes the
thermodynamic width dimension. On a CUDA device with unused parallel capacity,
increasing thermo_hidden_dim may not produce a proportional wall-clock
increase. This makes the software useful on its own, even before a physical
thermodynamic chip exists.
The second is theoretical and hardware-facing: modeled physical inference time is constant with width under the parallel thermodynamic substrate model. That claim is stronger and cleaner, but it depends on the future hardware assumption that width maps to parallel physical units.
The important scaling distinction is therefore between GPU emulator runtime and modeled physical runtime.
In a conventional digital sampler, sampling cost usually grows with at least one of these quantities:
- number of variables
- number of MCMC sweeps or diffusion steps
- number of replicas or chains
- number of hidden units
- batch size
In the thermodynamic hardware model, many of those dimensions become parallel area and power costs rather than sequential time costs. If a million units are available on the substrate, they all evolve during the same physical window. The modeled forward-pass time is controlled by the relaxation window, not by the number of units participating in that window.
This package exposes that separation directly. In the width-scaling
experiments, the thermodynamic MLP and transformer FFN modeled physical time
stay fixed as width increases. The thermodynamic CNN extends the same idea to
local receptive fields: thermo_channels can widen the local stochastic feature
bank while the modeled convolutional physical time remains the same fixed
window. PyTorch wall time is still a normal GPU
measurement, and it is valuable in its own right: if it stays roughly flat over
your target width range, then the GPU-only path is already useful as a wide
stochastic layer family. The physical-time metric shows what the same
computation would ask from a massively parallel thermodynamic substrate.
The same principle now applies to the thermodynamic transformer layer. The
thermo_hidden_dim can be varied from small to very wide while the modeled
feed-forward physical time remains t_f. That is the central result: variable
width neural inference in constant modeled physical time.
For the GPU-only path, the key engineering question is not “is wall time
mathematically constant?” It is “where is the plateau on this GPU for this
batch, sequence length, dtype, and width range?” The shipped benchmarks answer
that question empirically by reporting wall_ms_median beside
physical_time.
The readout-alignment experiment adds a first training counterpart: a wider thermodynamic transformer reservoir improves one-shot fitted error while the inference window remains fixed. Training still includes a digital solve, but it does not require backpropagating through many stochastic inference passes.
What This Could Mean For Huge Models
For huge models, the first serious use case is GPU-only experimentation. Treat the thermodynamic layer as a practical stochastic PyTorch module that can approximate very wide nonlinear feature arrays while measuring whether the CUDA wall-clock cost remains flat enough to matter. This can be useful before any thermodynamic chip exists.
The most interesting target is not replacing every digital operation. The strongest near-term opportunity is replacing the stochastic and probabilistic core: sampling, uncertainty propagation, energy-based inference, diffusion-like refinement, latent-variable search, and ensemble-style exploration.
The highest-leverage claim is simple and aggressive: if the model’s width is mapped onto parallel thermodynamic units, then widening the neural computation does not have to lengthen the modeled physical inference time. To our knowledge, this is not how mainstream transformer scaling is usually framed. Conventional GPU inference treats width as more arithmetic. A thermodynamic substrate treats width as more parallel physical dynamics.
If those computations can be mapped to dense thermodynamic arrays, the scaling profile changes:
- On GPUs today, moderate width increases may fit inside an existing parallel execution plateau rather than producing proportional latency.
- Wider latent spaces increase hardware area, not necessarily physical latency.
- Wider transformer feed-forward blocks increase thermodynamic fabric, not necessarily the fixed-time inference window.
- Wider convolutional feature banks increase local thermodynamic fabric, not necessarily the fixed-time local inference window.
- More parallel samples increase device count, not necessarily sequential sampling time.
- Multimodal exploration can use replica temperature ladders in the same fixed physical window.
- Energy-based and diffusion-like models could trade long digital sampling loops for short analog relaxation windows.
- Training can remain digital while inference and sampling move to a lower energy physical substrate.
The practical implication is that frontier-scale probabilistic models may not need to pay digital iteration costs for every sample forever. If the analog substrate is large enough and well matched to the model, the bottleneck can move from “how many sequential sampling steps do we need?” to “how much parallel thermodynamic fabric can we afford?”
For huge transformers, the thermodynamic transformer layer points at a concrete hybrid path: keep token routing, memory, and parts of attention digital where that is practical, but move the wide nonlinear feed-forward and stochastic sampling work into thermodynamic layers. On GPUs, that means exploiting the wall-clock plateau where it exists. On hardware, that means fixed-time physical dynamics. That is how this package frames the frontier opportunity: inference of variable-width neural blocks with unusually weak latency scaling today, and constant modeled physical time on the target substrate.
Thermodynamic Readout Alignment is the first practical training answer in this direction. Instead of training every stochastic element by backpropagation, the thermodynamic block can act as a huge physical feature generator and the readout can be solved or locally adapted. For very large models, that suggests a hybrid training stack: digital pretraining and routing where needed, local or closed-form readout alignment where possible, and thermodynamic inference for the wide stochastic nonlinear core.
Parallel Tempered End-To-End Training is an optional answer when a frozen reservoir is not enough and the cold path appears stuck. It trains the whole thermodynamic transformer layer by running multiple temperature replicas of the model itself. That gives a path toward ordinary inductive training while retaining thermodynamic exploration as an optimization mechanism, but it should not be assumed to win without scale-specific evidence.
Cold End-To-End Training matters for pragmatism. If a single replica already learns well, it is the right default because it is cheaper in memory and wall time. Parallel tempering should be treated as an exploration upgrade, not a free lunch.
Parallel Tempered Mask Alignment adds the next ingredient: sparse structural search over the physical feature fabric. That matters for hardware because not every thermodynamic feature needs to be wired into every downstream channel. A parallel-tempered training phase can search the wiring pattern while preserving the constant-time inference story after the sparse readout is programmed.
That does not remove engineering constraints. Coupling precision, calibration, readout bandwidth, memory movement, temperature control, and train-to-hardware mapping all matter. But the upside is large enough to justify building software emulators now: they let us discover which algorithms actually benefit from the physics before the hardware path is fixed.
Development Notes
Run all local checks:
python scripts/run_tests.py
python scripts/run_stress.py
python scripts/run_smoke.py
python scripts/run_poc.py
python scripts/run_precision_experiments.py --device cpu --outdir artifacts
python scripts/run_precision_training_comparison.py --device cpu --steps 20 --repeats 3 --outdir artifacts
python scripts/run_flow_matching_experiment.py --device cpu --train-steps 96 --sample-count 128 --outdir artifacts
python scripts/run_cnn_experiment.py --device cpu --train-steps 80 --outdir artifacts
python scripts/run_gaussian_process_experiment.py --device cpu --train-samples 32 --test-samples 96 --random-features 128 --outdir artifacts
python scripts/run_readout_training_comparison.py --device cpu --flow-steps 96 --cnn-steps 80 --outdir artifacts
python scripts/run_experiments.py --outdir artifacts
python scripts/run_benchmarks.py --outdir artifacts
Keep claims grounded when interpreting results:
physical_timeis an emulator metric, not a measured chip benchmark.- PyTorch/CUDA wall time is a practical software result. A flat wall-clock plateau is useful for GPU-only deployment, but it is not the same as a hardware-theory guarantee.
- The GPU wall-clock plateau is shape- and device-dependent. Expect saturation once width, batch, sequence length, dtype, memory bandwidth, or kernel occupancy becomes limiting.
- Constant physical time applies to parallel units inside a fixed sequential layer schedule. Deeper unpipelined networks still add sequential fixed-time windows.
- The variable-width constant-time claim applies to modeled physical time under a parallel thermodynamic substrate. It does not mean CPU/GPU software runtime is independent of tensor width for all shapes.
- Thermodynamic Readout Alignment is a fast readout-training method, not a complete replacement for all gradient-based training.
- Cold End-To-End Training is the default no-ridge inductive path. It is the first method to try when memory or training time matters.
- Parallel Tempered End-To-End Training is an optional exploration upgrade. It is more expensive because it trains full model replicas, and our current small experiment did not show enough value to make it the default.
- Parallel Tempered Mask Alignment uses tempering to search sparse readout structure. It is currently a training-time algorithm; inference still uses a single fixed thermodynamic pass.
Similar Articles
General Compute
General Compute is a product offering an inference cloud optimized for speed to run AI models.
Rewriting model inference with CUDA kernels: the bottleneck was not just GEMM [P]
Author describes building FlashRT, a CUDA-first inference runtime that rewrites model inference paths with C++/CUDA kernels to address bottlenecks beyond GEMM for small-batch/realtime workloads, achieving significant latency improvements on Jetson Thor and RTX 5090. The article discusses lessons on precision (FP8 helpful, FP4 mixed) and the need to bypass generic runtimes for realtime inference.
MTP benchmark results: the nature of the generative task dictates whether you will benefit (coding) or get slower inference (creative) from speculative inference. No other factor comes close.
A systematic analysis of Qwen 3.6 27B benchmarks reveals that speculative inference (MTP) significantly accelerates coding tasks but slows down creative writing, with task type being the dominant factor over quantization or temperature settings.
PaT: Planning-after-Trial for Efficient Test-Time Code Generation
This paper introduces PaT (Planning-after-Trial), an adaptive test-time computation strategy for code generation that reduces inference costs by approximately 69% while maintaining performance comparable to larger models.
Scalable Inference-Time Annealing with Surrogate Likelihood Estimators
SITA (Scalable Inference-Time Annealing) introduces a method for efficiently sampling molecular Boltzmann distributions by retraining flow-based models along a temperature ladder using energy-based surrogate likelihoods, avoiding costly divergence computations. The approach achieves state-of-the-art performance on Alanine Dipeptide and Tripeptide benchmarks.