@neural_avb: Lurking the Reasoning Training docs rn. Time to write a verifiers env and Unsloth/TRL that shit! Video soon if it all g…
Summary
The user is working on implementing reasoning training with verifiers using Unsloth and TRL, reporting progress on locally generating GRPO-like rollouts with a small SLM and a tiny RM, and promises a video soon.
View Cached Full Text
Cached at: 06/11/26, 09:45 PM
Lurking the Reasoning Training docs rn. Time to write a verifiers env and Unsloth/TRL that shit!
Video soon if it all goes well 🤞🏼 https://t.co/vlbBpXDxXa
AVB (@neural_avb): Locally generating GRPO-like rollouts with my SLM, and using this tiny RM as the rubric. Next I’ll be RL training on free form text and QA.
This is
- super fast
- way better than F1/ROGUE/BertScore
- 80% agreement with external judge LM (deepseek)
RL with Unverifiable Rewards!
Similar Articles
@neural_avb: Locally generating GRPO-like rollouts with my SLM, and using this tiny RM as the rubric. Next I'll be RL training on fr…
Neural_avb releases a lightweight Answer-eq Reward Model for RL training on QA tasks, claiming 80% agreement with external judge LM and faster than F1/ROUGE/BertScore.
@jiqizhixin: Awesome blog! State of RL for reasoning LLMs https://aweers.de/blog/2026/rl-for-llms/…
A comprehensive blog post reviewing the state of reinforcement learning for reasoning LLMs, covering methods from REINFORCE and PPO to GRPO and beyond, with connections to key models like InstructGPT and DeepSeek-R1.
@neural_avb: If you think about it, LLM training in 2026 is really a 3-step loop : - train it on some data - dogfood it/run categori…
The tweet outlines a 3-step loop for LLM training in 2026: train on data, run evals, and add synthetic data for underperforming tasks. It emphasizes the accessibility of legal distillation via open source models and cheap APIs, noting that training on reasoning traces alone can achieve high scores.
@neural_avb: Next video is on training tiny (<1B) models for preference tuning. Plus how to generate preference datasets with local …
Announces an upcoming video on training tiny models for preference tuning, covering reward models, RLHF, DPO, ORPO with Unsloth and TRL.
@neural_avb: https://x.com/neural_avb/status/2063907440509571354
Explores a common failure mode in recursive language models (RLMs) where free-text subagent responses cause issues, and presents a solution using structured outputs to improve reliability, illustrated with a long-context question-answering example from NarrativeQA.