Nous Research Releases Token Superposition Training to Speed Up LLM Pre-Training by Up to 2.5x Across 270M to 10B Parameter Models

Reddit r/singularity 05/16/26, 09:11 AM Tools

llm pre-training efficiency nous-research token-superposition mixture-of-experts

Summary

Nous Research releases Token Superposition Training (TST), a method that speeds up LLM pre-training by up to 2.5x across models from 270M to 10B parameters, reducing wall-clock time without altering architecture or data.

https://arxiv.org/abs/2605.06546 https://nousresearch.com/token-superposition Pre-training large language models is expensive enough that even modest efficiency improvements can translate into meaningful cost and time savings. Nous Research is releasing Token Superposition Training (TST), a method that substantially reduces pre-training wall-clock time at fixed compute without touching the model architecture, optimizer, tokenizer, parallelism strategy, or training data. At the 10B-A1B mixture-of-experts scale, TST reaches a lower final training loss than a matched-FLOPs baseline while consuming 4,768 B200-GPU-hours versus the baseline’s 12,311 — roughly a 2.5x reduction in total pre-training time.

Original Article

Similar Articles

Efficient Pre-Training with Token Superposition

Hugging Face Daily Papers

Token-Superposition Training (TST) improves LLM pre-training efficiency by combining contiguous tokens into bags during a superposition phase with a multi-hot cross-entropy objective, achieving up to 2.5x reduction in training time without architectural changes.

I trained a 75M parameter LLM from scratch on 18B tokens and it beats a model almost double its size

Reddit r/LocalLLaMA

Trained a 75M parameter LLM called KeyLM from scratch on 18B tokens, achieving competitive instruction-following scores against larger models while using fewer parameters and less data.

Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training

arXiv cs.CL

This paper discovers predictable scaling laws for optimal hyperparameters (learning rate, batch size) in LLM continued pre-training, proposing a two-stage framework that reduces hyperparameter search overhead by up to 90% while maintaining performance.

Apertus LLM Family Expansion via Distillation and Quantization

arXiv cs.LG

This paper validates distillation and quantization as cost-effective methods to expand the Apertus LLM family to new sizes and hardware formats, producing Apertus-v1.1 models with up to 4B parameters trained on 1.7T tokens.

@omarsar0: Cool idea from Nous Research. What if you could speed up long-context pretraining with a subquadratic wrapper that you …

X AI KOLs Following

Nous Research introduces Lighthouse Attention, a training-only subquadratic wrapper for scaled dot-product attention that accelerates long-context pretraining and can be removed before deployment to preserve vanilla inference efficiency.