llm-post-training

#llm-post-training

Cross-Epoch Adaptive Rollout Optimization for RL Post-Training

arXiv cs.LG ↗ · 2026-06-05 Cached

This paper presents CERO, a cross-epoch adaptive rollout optimization method for RL post-training of LLMs, which allocates a fixed rollout budget across prompts and epochs using Bayesian posterior variance to maximize sample efficiency, achieving theoretical regret bounds and outperforming GRPO on mathematical reasoning tasks.

0 favorites 0 likes

#llm-post-training

RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains

Hugging Face Daily Papers ↗ · 2026-05-27 Cached

RUBRIC-ARROW presents an alternating framework for reward modeling that improves upon rubric-based methods by reducing ties and leveraging pairwise preference data, achieving competitive accuracy and gains for LLM post-training in non-verifiable domains.

0 favorites 0 likes

#llm-post-training

EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation

arXiv cs.AI ↗ · 2026-05-25 Cached

This paper introduces EDGE-OPD, a modification of on-policy self-distillation for LLMs that uses guided rollouts and evidence masks to internalize privileged context without degrading general capabilities, showing success in rare-token identity settings.

0 favorites 0 likes

#llm-post-training

@anyscalecompute: LLM post-training is the new baseline. Picking the wrong method or GPU config is how you waste a 36-hour run. Introduci…

X AI KOLs Following ↗ · 2026-05-15 Cached

Anyscale introduces a new Agent Skill for LLM post-training that automatically selects the optimal fine-tuning method (SFT, DPO, GRPO, etc.) and generates ready-to-launch configs, helping avoid wasted GPU runs.

0 favorites 0 likes

#llm-post-training

@sheriyuo: The Hands-on Modern RL tutorial everyone has been waiting for is finally available in English PDF download link: https:…

X AI KOLs Timeline ↗ · 2026-05-15 Cached

An open-source hands-on modern reinforcement learning course covering from classic control to LLM post-training, RLHF, DPO, GRPO, and agentic RL is now available as a free English PDF download.

0 favorites 0 likes

#llm-post-training

Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training

arXiv cs.CL ↗ · 2026-05-08 Cached

This paper challenges the 'Locate-then-Update' paradigm in LLM post-training by demonstrating that static mechanistic localization is insufficient due to the dynamic evolution of neural circuits during fine-tuning. It introduces new metrics to analyze circuit stability and proposes the need for predictive frameworks in mechanistic localization.

0 favorites 0 likes

#llm-post-training

TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only

arXiv cs.CL ↗ · 2026-04-22 Cached

TRN-R1-Zero introduces a post-training framework that enables LLMs to perform zero-shot reasoning on text-rich networks using only reinforcement learning, without supervised fine-tuning or chain-of-thought data.

0 favorites 0 likes

llm-post-training

Cross-Epoch Adaptive Rollout Optimization for RL Post-Training

RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains

EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation

@anyscalecompute: LLM post-training is the new baseline. Picking the wrong method or GPU config is how you waste a 36-hour run. Introduci…

@sheriyuo: The Hands-on Modern RL tutorial everyone has been waiting for is finally available in English PDF download link: https:…

Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training

TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only

Submit Feedback