Tag
Bebop proposes entropy-aware multi-token prediction with rejection sampling and a novel TV loss to accelerate RL training of LLMs, achieving up to 1.8x speedup. The method addresses the degradation of acceptance rates during RL by optimizing training objectives.