@neural_avb: Locally generating GRPO-like rollouts with my SLM, and using this tiny RM as the rubric. Next I'll be RL training on fr…

X AI KOLs Timeline 06/11/26, 08:34 AM Models

reward-model slm rl-training qa open-source grpo

Summary

Neural_avb releases a lightweight Answer-eq Reward Model for RL training on QA tasks, claiming 80% agreement with external judge LM and faster than F1/ROUGE/BertScore.

Locally generating GRPO-like rollouts with my SLM, and using this tiny RM as the rubric. Next I'll be RL training on free form text and QA. This is - super fast - way better than F1/ROGUE/BertScore - 80% agreement with external judge LM (deepseek) RL with Unverifiable Rewards! https://t.co/xNzUWSxgrj

Original Article

View Cached Full Text

Cached at: 06/12/26, 04:54 AM

Locally generating GRPO-like rollouts with my SLM, and using this tiny RM as the rubric. Next I’ll be RL training on free form text and QA.

This is

super fast
way better than F1/ROGUE/BertScore
80% agreement with external judge LM (deepseek)

RL with Unverifiable Rewards!

Yeah! For this one I got GPT to write some rich… I’d normally won’t bother with all these pretty printing/streaming, but since this will all go in a YT video in the end… making things look aesthetically pleasing is one of the side missions.

Let me think about it! I need to first crystallize this philosophy myself

Similar Articles

@neural_avb: Lurking the Reasoning Training docs rn. Time to write a verifiers env and Unsloth/TRL that shit! Video soon if it all g…

X AI KOLs Timeline

The user is working on implementing reasoning training with verifiers using Unsloth and TRL, reporting progress on locally generating GRPO-like rollouts with a small SLM and a tiny RM, and promises a video soon.

@neural_avb: This post-training article came out earlier this year and completely flew under my radar. Highly recommended for my GRP…

X AI KOLs Timeline

A recommendation of a post-training article on GRPO/RLVR that was overlooked earlier this year, aimed at those interested in reinforcement learning from verifiable rewards.

@neural_avb: https://x.com/neural_avb/status/2063907440509571354

X AI KOLs Timeline

Explores a common failure mode in recursive language models (RLMs) where free-text subagent responses cause issues, and presents a solution using structured outputs to improve reliability, illustrated with a long-context question-answering example from NarrativeQA.

Cross-Epoch Adaptive Rollout Optimization for RL Post-Training

arXiv cs.LG

This paper presents CERO, a cross-epoch adaptive rollout optimization method for RL post-training of LLMs, which allocates a fixed rollout budget across prompts and epochs using Bayesian posterior variance to maximize sample efficiency, achieving theoretical regret bounds and outperforming GRPO on mathematical reasoning tasks.

@RyanBoldi: Your RL post-training may be sabotaging your LLM’s test-time scaling! Conventional RL pretends that you can collapse al…

X AI KOLs Following

Introduces Vector Policy Optimization (VPO), a new RL method that handles vector-valued rewards to improve test-time scaling for LLMs, outperforming conventional scalar reward approaches.

Similar Articles

@neural_avb: Lurking the Reasoning Training docs rn. Time to write a verifiers env and Unsloth/TRL that shit! Video soon if it all g…

@neural_avb: This post-training article came out earlier this year and completely flew under my radar. Highly recommended for my GRP…

@neural_avb: https://x.com/neural_avb/status/2063907440509571354

Cross-Epoch Adaptive Rollout Optimization for RL Post-Training

@RyanBoldi: Your RL post-training may be sabotaging your LLM’s test-time scaling! Conventional RL pretends that you can collapse al…

Submit Feedback