Tag
This paper presents CERO, a cross-epoch adaptive rollout optimization method for RL post-training of LLMs, which allocates a fixed rollout budget across prompts and epochs using Bayesian posterior variance to maximize sample efficiency, achieving theoretical regret bounds and outperforming GRPO on mathematical reasoning tasks.
RUBRIC-ARROW presents an alternating framework for reward modeling that improves upon rubric-based methods by reducing ties and leveraging pairwise preference data, achieving competitive accuracy and gains for LLM post-training in non-verifiable domains.
This paper introduces EDGE-OPD, a modification of on-policy self-distillation for LLMs that uses guided rollouts and evidence masks to internalize privileged context without degrading general capabilities, showing success in rare-token identity settings.
Anyscale introduces a new Agent Skill for LLM post-training that automatically selects the optimal fine-tuning method (SFT, DPO, GRPO, etc.) and generates ready-to-launch configs, helping avoid wasted GPU runs.
An open-source hands-on modern reinforcement learning course covering from classic control to LLM post-training, RLHF, DPO, GRPO, and agentic RL is now available as a free English PDF download.
This paper challenges the 'Locate-then-Update' paradigm in LLM post-training by demonstrating that static mechanistic localization is insufficient due to the dynamic evolution of neural circuits during fine-tuning. It introduces new metrics to analyze circuit stability and proposes the need for predictive frameworks in mechanistic localization.
TRN-R1-Zero introduces a post-training framework that enables LLMs to perform zero-shot reasoning on text-rich networks using only reinforcement learning, without supervised fine-tuning or chain-of-thought data.