Token AI releases a research paper introducing STAM, a new adaptive momentum optimizer designed to improve training stability and reduce memory usage compared to standard optimizers like AdamW.
https://preview.redd.it/3ccm5gd1puzg1.png?width=1179&format=png&auto=webp&s=c940d2e6ef1d61288ac214eae4679a7c910b7917 Today, I’m talking about a new research paper from Token AI: "Stable Training with Adaptive Momentum" It introduces what could be one of the strongest optimizers, both in theory and in results. For years, we’ve relied on well-known optimizers like Adam, AdamW, LAMB, and others. No doubt, they’ve been the go-to choices when training AI models. If you’re not familiar with what an optimizer is, in simple terms: it’s a core part of training any AI model. It’s the algorithm responsible for updating the model’s weights during training to reduce the loss. That said, these optimizers come with limitations that affect training. For example, Adam uses a fixed beta1 throughout training, which can carry outdated momentum and keep pushing the model in the wrong direction. STAM addresses this by measuring the difference between the current gradient and previous momentum (g - m). When the difference is large, it reduces beta1, leading to more stable training during noisy phases. Another issue appears when there’s a shift or noise in training. Old momentum can become harmful. STAM handles this with an adaptive beta1 based on residual variance. A major issue in SGD is that if the direction becomes wrong, it keeps going due to fixed momentum. STAM solves this by allowing the first momentum to self-correct. Now let’s talk about STAMLite, the lighter version. It’s designed to replace AdamW as a default choice in many cases. The key difference is that beta1 is dynamic instead of fixed: * If gradients are noisy, it reduces momentum * If gradients are stable, it keeps momentum high It also improves efficiency in terms of optimizer state memory: * AdamW requires about 2× the parameter size * STAM Full is close to AdamW * STAMLite requires about 1× the parameter size In practice, STAMLite saves around 50% of the resources compared to AdamW and STAM, meaning significantly less GPU usage during training. Looking at benchmarks, the results speak for themselves. In Hyperparameter Sweep, STAMLite achieved: Accuracy: 0.61 Loss: 0.91 In Long-Horizon Non-Stationary MLP, STAM ranked first alongside NAdam with nearly identical results: Accuracy: 0.97 Loss: 0.09 More benchmarks are available on the website and in the research paper. This is an important step from TokenAI, breaking the long-standing reliance on a limited set of optimizers that come with known issues. Even as an early release, it proves strong and promising. Personally, I’ve already shifted to STAM and I’m currently training my first full LLM from scratch using it. I’ll be sharing the results soon. Research paper: [https://tokenai.cloud/research/stam](https://tokenai.cloud/research/stam) Let me know what you think.
The author announces the release of their first AI research paper, STAM (Stable Training with Adaptive Momentum), a new deep learning optimizer addressing stability and resource efficiency, and invites feedback from the AI community.
The article distills 28 research papers into a 10-layer stack for building self-improving harnesses around AI models, emphasizing bounded, gated changes over general agent loops.
STARE addresses policy entropy collapse in GRPO-based reinforcement learning for large language models by introducing surprisal-guided token-level advantage reweighting and target-entropy regulation, achieving 4%-8% accuracy gains on AIME benchmarks.
A roundup of three notable AI papers: SkillOpt treats skill documents as trainable parameters to optimize frozen agents; a new method compiles agentic workflows into model weights for 100x cost reduction; and AutoScientists introduces a decentralized agent team for long-running science without a central planner.
This paper introduces TRAM, a method that jointly optimizes approximate multiplier structures and AI model parameters to reduce power consumption in AI accelerators while maintaining accuracy.