@lqiao: https://x.com/lqiao/status/2070026145895256314

X AI KOLs Following Products

Summary

Fireworks is offering a managed service for reinforcement learning training on GLM 5.2 that ensures numerical identity between training and inference via batch invariance and zero-KLD alignment, previously only available to top frontier labs. This allows anyone to customize and surpass frontier quality.

https://t.co/KlrnWFHMNn
Original Article
View Cached Full Text

Cached at: 06/25/26, 11:17 AM

Bitwise Identical Separates Frontier Training

The biggest differentiator in frontier training is not so much in the model or huge GPU fleets. It’s a few bits to get the numerics exactly aligned.

We shipped batch invariance and zero-KLD for on-policy RL training of large MoE models, including GLM 5.2!

A tiny numerical mismatch between training and inference can derail reinforcement learning on giant models. Only a handful of frontier labs had the infrastructure to eliminate it.

We are shifting that control today – GLM 5.2 is a frontier-quality open-weight model. It’s now possible for anyone to customize it and push beyond frontier quality in their own application.

The hard part of reinforcement learning on a frontier model is usually not the algorithm. It’s the infrastructure that keeps training and inference numerically identical: zero KLD, end to end. At Fireworks, we’ve long invested in training quality, and are now offering it as a managed service, starting with GLM 5.2.

For years, the expert teams getting reinforcement learning to actually work on giant models was limited to the top frontier labs with the unglamorous infrastructure underneath: training and serving stacks engineered to produce the same numbers, repeatably.

That infrastructure is built on concepts people outside those labs have never had to think about, for example batch invariance and zero-KLD across training and serving: making the rollout engine and trainer produce the same numbers. It sounds like plumbing, but it’s the difference between an RL run that hill climbs successfully, and one that quietly falls apart.

The toolkit that used to be locked inside a handful of labs is now available as a managed service on Fireworks on GLM 5.2. We help you reach frontier specialized intelligence you can own instead of rent.

The features that used to be frontier-lab-only

Reinforcement learning on an LLM is a loop: the model generates responses, those responses are scored, and the trainer moves the weights. The whole thing rests on the assumption that the probability the trainer thinks the model gave each token matches the probability the serving engine actually used to generate it. When that holds, learning signal flows. When it doesn’t, you’re optimizing against noise.

Holding that assumption on a modern frontier model is genuinely hard, and Fireworks gives you the tools the big labs built to do it:

  • Batch invariance for Large MoEs - a request returns the same result no matter what other traffic happens to share its batch. Without it, an “on-policy” run is quietly off-policy: the rollout was generated by a subtly different model than the one you’re updating, just because the server was busy. The building blocks are increasingly public: open-source engines ship batch-invariant kernels for smaller, dense models (vLLM, SGLang), and DeepSeek’s DeepGEMM, the kernel library behind DeepSeek-V4, provides batch-invariant grouped-GEMM kernels for MoEs. It replaces cuBLAS end to end and drops the split-K trick precisely because split-K breaks invariance. But a pile of batch-invariant kernels is not a batch-invariant system. True end-to-end invariance means every reduction - attention, the MoE router, the expert GEMMs, and the multi-rank all-reduce stack - stays consistent together, under real production load. Delivering that whole-system guarantee for a frontier MoE like GLM 5.2 as a managed service is, to our knowledge, a first in the industry.

  • Zero-KLD train/serve alignment - for a model like GLM 5.2, the usual patches don’t reach far enough. The popular Mixture-of-Experts fix, router replay (replaying the serving engine’s expert choices inside the trainer), handles which experts fire but it can’t touch the other place these models diverge: which tokens the sparse-attention indexer selects. Those selections aren’t tractable to replay. So there’s no halfway house here. The generation engine, the prompt-reading path, and the trainer have to share one numerical definition, so what you train is exactly what you served: zero KLD, end to end.

Zero-KLD Train/Serve Loop: The same request flows through serving prefill and generation, then through the trainer path. The two streams converge when served logits and trainer logprobs match at KLD = 0.

Zero-KLD Train/Serve Loop: The same request flows through serving prefill and generation, then through the trainer path. The two streams converge when served logits and trainer logprobs match at KLD = 0.

Zero-KLD Train/Serve Loop: The same request flows through serving prefill and generation, then through the trainer path. The two streams converge when served logits and trainer logprobs match at KLD = 0.

These are exactly the pieces most platforms don’t have, and their absence is why so many RL efforts stall.

Validations

Here is the same RL task, the GLM countdown reasoning task, run two ways. Both use the same algorithm and data. The only difference is the numerics underneath.

Validation Runs: The same GLM countdown reasoning task behaves differently when the trainer and rollout engine disagree: reward collapses and clipping throws away learning signal. With zero-KLD numerics, the loop stays on-policy.

Validation Runs: The same GLM countdown reasoning task behaves differently when the trainer and rollout engine disagree: reward collapses and clipping throws away learning signal. With zero-KLD numerics, the loop stays on-policy.

Validation Runs: The same GLM countdown reasoning task behaves differently when the trainer and rollout engine disagree: reward collapses and clipping throws away learning signal. With zero-KLD numerics, the loop stays on-policy.

Without the Fireworks numerics stack, the trainer and the rollout engine disagree (train-inference KL around 0.013), and the run leans hard on the industry’s usual crutch: importance sampling and clipping were discarding about 45% of every batch’s tokens just to stay upright. It still wasn’t enough. Around step 20 the reward collapses, falling from around 0.9 to under 0.2 as the policy chases a target that no longer matches what it generated.

With the Fireworks stack, the trainer and serving engine run at zero KLD, end to end - bit-for-bit identical - with zero tokens clipped, and reward stays healthy across the entire run. Same task, same algorithm. The only thing that changed was making the numbers agree.

That’s the trap with the importance-sampling-and-clipping approach: it’s a tax, not a fix. Every clipped token is learning signal thrown away, and past a point no amount of clipping saves a run whose numbers don’t line up.

Why bitwise identical is a complex problem

It comes down to a property of floating-point math that trips up almost everyone: addition isn’t associative. (a + b) + c doesn’t equal a + (b + c) down at the bit level, so the order in which a GPU adds numbers up changes the answer, usually in the last few digits, occasionally by enough to flip a token.

A frontier MoE changes that order constantly, for reasons that have nothing to do with your request:

  • Latent attention (MLA) - the compression trick that makes GLM cheap at long context splits its reduction into chunks across the GPU whose boundaries shift with whatever sequence lengths happen to share the batch, so the same query can accumulate its attention in a different order from one moment to the next.

  • The sparse indexer that decides which past tokens each query even looks at can hand back the same set in a different order. Because the attention sum follows that order, the result drifts.

  • Each expert’s matmul runs a different kernel and tiling depending on how many tokens that expert drew this step, which depends on everyone else’s tokens, not just yours.

  • The router can land on a near-tie between two experts; a rounding-error-sized wobble flips which one fires, and the token’s entire computation changes with it.

  • Across GPUs, the all-reduce that stitches partial sums back together switches algorithms by message size, which again rides on load.

Stack those together and “temperature 0” on a busy server is quietly nondeterministic: the same prompt, co-batched with different traffic, comes back subtly and sometimes meaningfully different. That’s the gap that silently turns an on-policy RL run off-policy.

Getting to numerics you can trust meant pinning every one of those decisions so it depends only on your request: a fixed reduction order in attention no matter the batch, one settled kernel choice for the expert matmuls regardless of token counts, a deterministic tie-break in the router, and a single fixed cross-GPU reduction path, all without giving up so much speed that the trainer becomes unusable for everyday SFT/DPO.

We then aligned the trainer’s own forward pass to that same serving definition so on the validated GLM-5.2 LoRA path, trainer and serving reach zero KLD - a train-inference generation KL of exactly 0, bit-for-bit identical, and the served model returns the same output at temperature 0 regardless of concurrency.

Zero KLD, end to end

Tiny train-inference numerical disagreements aren’t benign noise: they quietly turn on-policy RL off-policy and can independently cause a run to collapse. For related background on inference nondeterminism, see TML; for the RL failure mode, see this diagnosis. The clean answer isn’t to correct the gap with importance sampling, which only piles on variance; it’s to erase it. When training and inference are made bitwise consistent so the KLD is exactly 0, RL trains in fewer steps and reaches higher reward (vLLM x TorchTitan).

That’s what Fireworks delivers on GLM 5.2: batch-invariant serving with zero-KLD train-inference alignment. The rollout engine returns bit-for-bit identical logits no matter the batch size, the concurrent load, or how many GPUs it’s sharded across, and the trainer is held to that same bit-exact standard, so the full train-rollout loop runs at zero KLD. This is the guarantee frontier labs build in-house, and it’s genuinely hard to hold across different engines, kernels, and parallelism layouts. Determinism across tensor-parallel sizes alone was an open research problem as recently as this year (TBIK, ICML 2026).

The catch everyone else hits is speed: open-source deterministic modes typically run 35-60% slower (SGLang). Fireworks pays virtually none of that tax. The GLM trainer holds around 3,500 tokens/sec per node, on par with the OSS TileLang implementation, and layers the zero-KLD numerics on top rather than trading speed for them. One stack gives the SFT/DPO majority full throughput and gives RL teams zero-KLD numerics: frontier-lab infrastructure, delivered as a managed service.

What you get on Fireworks today

  • A frontier model, RL-ready. GLM 5.2 is live for fine-tuning through the Fireworks Training API, with the full numerical foundation carried forward from GLM 5.1. The validated training shape is public today.

  • The methods that matter. SFT, DPO, and RL through the Training API; SFT and DPO on managed training. On-policy RL where the trainer and serving engine genuinely agree so your signal is real learning, not drift you clipped away.

  • Reproducible, auditable inference. Temperature-0 requests return the same answer regardless of server load so you get trustworthy evals, meaningful regression tests, and the reproducibility enterprise compliance demands.

  • Fast where it counts. A trainer tuned for SFT/DPO throughput, rollouts that generate about 1.8x faster on GLM 5.2 than GLM 5.1 (around 5,000 tokens/sec per node, per promotion-CI), and zero-KLD train-serve numerics, so RL is genuinely on-policy.

  • Managed, self-service, and co-located. Run the full loop on managed infrastructure with trainer and deployment co-located for fast weight sync, or drive longer runs yourself through the API.

Why it matters: frontier specialized intelligence as a moat

The frontier is open now. Anyone can download a state-of-the-art model. What’s still scarce is the ability to do reinforcement learning on one of these models correctly: a loop whose numbers line up well enough to converge, results you can reproduce, and a trainer fast enough to iterate on. That used to live only inside the biggest labs.

The work behind GLM 5.1 and GLM 5.2 on Fireworks: bitwise zero-KLD numerics for RL and raw speed for SFT/DPO, in one stack, is exactly the infrastructure that used to require a frontier lab’s in-house systems team.

To get started, please reach out to our training team today, or dive straight in with our managed training docs.

Similar Articles

@Vtrivedy10: https://x.com/Vtrivedy10/status/2066571435871551655

X AI KOLs Timeline

A joint study by LangChain Labs and Fireworks AI demonstrates fine-tuning an open Qwen model to create a trace judge that detects 'perceived error' in production traces, achieving frontier performance at up to 100x lower cost. The model is evaluated on two internal datasets and shows generality across applications.

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

arXiv cs.CL

This paper proposes the LLM-as-Environment-Engineer framework, where a policy model analyzes failures to automatically redesign the training environment for reinforcement learning, and introduces MAPF-FrozenLake as a controllable testbed. The framework, using Qwen3-4B, outperforms larger models like GPT and Gemini, showing that policy learning improves the model's ability to diagnose weaknesses.