Tag
This paper diagnoses Training-Inference Mismatch (TIM) in LLM reinforcement learning, showing that small numerical disagreements between training and inference token probabilities can cause training collapse, and proposes remedies.