@neural_avb: Lurking the Reasoning Training docs rn. Time to write a verifiers env and Unsloth/TRL that shit! Video soon if it all g…

X AI KOLs Timeline 06/11/26, 03:44 PM Tools

reasoning training reinforcement-learning slm verifiers unsloth trl

Summary

The user is working on implementing reasoning training with verifiers using Unsloth and TRL, reporting progress on locally generating GRPO-like rollouts with a small SLM and a tiny RM, and promises a video soon.

Lurking the Reasoning Training docs rn. Time to write a verifiers env and Unsloth/TRL that shit! Video soon if it all goes well 🤞🏼 https://t.co/vlbBpXDxXa

Original Article

View Cached Full Text

Cached at: 06/11/26, 09:45 PM

Lurking the Reasoning Training docs rn. Time to write a verifiers env and Unsloth/TRL that shit!

Video soon if it all goes well 🤞🏼 https://t.co/vlbBpXDxXa

AVB (@neural_avb): Locally generating GRPO-like rollouts with my SLM, and using this tiny RM as the rubric. Next I’ll be RL training on free form text and QA.

This is

super fast

way better than F1/ROGUE/BertScore

80% agreement with external judge LM (deepseek)

RL with Unverifiable Rewards!

Similar Articles

@neural_avb: Locally generating GRPO-like rollouts with my SLM, and using this tiny RM as the rubric. Next I'll be RL training on fr…

X AI KOLs Timeline

Neural_avb releases a lightweight Answer-eq Reward Model for RL training on QA tasks, claiming 80% agreement with external judge LM and faster than F1/ROUGE/BertScore.

@jiqizhixin: Awesome blog! State of RL for reasoning LLMs https://aweers.de/blog/2026/rl-for-llms/…

X AI KOLs Timeline

A comprehensive blog post reviewing the state of reinforcement learning for reasoning LLMs, covering methods from REINFORCE and PPO to GRPO and beyond, with connections to key models like InstructGPT and DeepSeek-R1.

@neural_avb: If you think about it, LLM training in 2026 is really a 3-step loop : - train it on some data - dogfood it/run categori…

X AI KOLs Timeline

The tweet outlines a 3-step loop for LLM training in 2026: train on data, run evals, and add synthetic data for underperforming tasks. It emphasizes the accessibility of legal distillation via open source models and cheap APIs, noting that training on reasoning traces alone can achieve high scores.

@neural_avb: Next video is on training tiny (<1B) models for preference tuning. Plus how to generate preference datasets with local …

X AI KOLs Timeline

Announces an upcoming video on training tiny models for preference tuning, covering reward models, RLHF, DPO, ORPO with Unsloth and TRL.

@neural_avb: https://x.com/neural_avb/status/2063907440509571354

X AI KOLs Timeline

Explores a common failure mode in recursive language models (RLMs) where free-text subagent responses cause issues, and presents a solution using structured outputs to improve reliability, illustrated with a long-context question-answering example from NarrativeQA.

Similar Articles

@neural_avb: Locally generating GRPO-like rollouts with my SLM, and using this tiny RM as the rubric. Next I'll be RL training on fr…

@jiqizhixin: Awesome blog! State of RL for reasoning LLMs https://aweers.de/blog/2026/rl-for-llms/…

@neural_avb: If you think about it, LLM training in 2026 is really a 3-step loop : - train it on some data - dogfood it/run categori…

@neural_avb: Next video is on training tiny (<1B) models for preference tuning. Plus how to generate preference datasets with local …

@neural_avb: https://x.com/neural_avb/status/2063907440509571354

Submit Feedback