Tag
This article details the technical architecture and training pipeline of IBM's Granite 4.1 LLMs, covering pre-training, SFT, and RL stages. It highlights that the 8B dense model outperforms larger MoE counterparts and notes the release under Apache 2.0 license.
Researchers propose SPS (Steering Probability Squeezing), a training paradigm combining reinforcement learning with inverse reinforcement learning to address probability squeezing in LLM reasoning training, where probability mass concentrates too narrowly on high-reward trajectories, limiting exploration and multi-sample performance (Pass@k). Experiments on five reasoning benchmarks demonstrate improved exploration and Pass@k metrics.