training-inference-mismatch

#training-inference-mismatch

The Mirage of Optimizing Training Policies: Monotonic Inference Policies as the Real Objective for LLM Reinforcement Learning

Hugging Face Daily Papers ↗ · 2026-06-28 Cached

We introduce MIPI (Monotonic Inference Policy Improvement) and its instantiation MIPU, a two-step RL framework for LLMs that addresses the training-inference mismatch by explicitly aligning optimization with inference-policy improvement. Under FP8-quantized rollout, MIPU achieves improved reasoning performance and training stability across Qwen3-1.7B and Qwen3-4B models.

0 favorites 0 likes

#training-inference-mismatch

Self-Generated Error Training for Token Editing in Diffusion Language Models

arXiv cs.CL ↗ · 2026-06-17 Cached

Proposes Self-Generated T2T, a training method that aligns token editing training with inference by using the model's own predictions as error sources, improving accuracy on LLaDA2.1.

0 favorites 0 likes

#training-inference-mismatch

Diagnosing Training Inference Mismatch in LLM Reinforcement Learning

arXiv cs.LG ↗ · 2026-05-15 Cached

This paper diagnoses Training-Inference Mismatch (TIM) in LLM reinforcement learning, showing that small numerical disagreements between training and inference token probabilities can cause training collapse, and proposes remedies.

0 favorites 0 likes

training-inference-mismatch

The Mirage of Optimizing Training Policies: Monotonic Inference Policies as the Real Objective for LLM Reinforcement Learning

Self-Generated Error Training for Token Editing in Diffusion Language Models

Diagnosing Training Inference Mismatch in LLM Reinforcement Learning

Submit Feedback