Nous Research Releases Token Superposition Training to Speed Up LLM Pre-Training by Up to 2.5x Across 270M to 10B Parameter Models
Summary
Nous Research releases Token Superposition Training (TST), a method that speeds up LLM pre-training by up to 2.5x across models from 270M to 10B parameters, reducing wall-clock time without altering architecture or data.
Similar Articles
Efficient Pre-Training with Token Superposition
Token-Superposition Training (TST) improves LLM pre-training efficiency by combining contiguous tokens into bags during a superposition phase with a multi-hot cross-entropy objective, achieving up to 2.5x reduction in training time without architectural changes.
I trained a 75M parameter LLM from scratch on 18B tokens and it beats a model almost double its size
Trained a 75M parameter LLM called KeyLM from scratch on 18B tokens, achieving competitive instruction-following scores against larger models while using fewer parameters and less data.
Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training
This paper discovers predictable scaling laws for optimal hyperparameters (learning rate, batch size) in LLM continued pre-training, proposing a two-stage framework that reduces hyperparameter search overhead by up to 90% while maintaining performance.
Apertus LLM Family Expansion via Distillation and Quantization
This paper validates distillation and quantization as cost-effective methods to expand the Apertus LLM family to new sizes and hardware formats, producing Apertus-v1.1 models with up to 4B parameters trained on 1.7T tokens.
@omarsar0: Cool idea from Nous Research. What if you could speed up long-context pretraining with a subquadratic wrapper that you …
Nous Research introduces Lighthouse Attention, a training-only subquadratic wrapper for scaled dot-product attention that accelerates long-context pretraining and can be removed before deployment to preserve vanilla inference efficiency.