@neural_avb: Lurking the Reasoning Training docs rn. Time to write a verifiers env and Unsloth/TRL that shit! Video soon if it all g…

X AI KOLs Timeline Tools

Summary

The user is working on implementing reasoning training with verifiers using Unsloth and TRL, reporting progress on locally generating GRPO-like rollouts with a small SLM and a tiny RM, and promises a video soon.

Lurking the Reasoning Training docs rn. Time to write a verifiers env and Unsloth/TRL that shit! Video soon if it all goes well 🤞🏼 https://t.co/vlbBpXDxXa
Original Article
View Cached Full Text

Cached at: 06/11/26, 09:45 PM

Lurking the Reasoning Training docs rn. Time to write a verifiers env and Unsloth/TRL that shit!

Video soon if it all goes well 🤞🏼 https://t.co/vlbBpXDxXa

AVB (@neural_avb): Locally generating GRPO-like rollouts with my SLM, and using this tiny RM as the rubric. Next I’ll be RL training on free form text and QA.

This is

  • super fast
  • way better than F1/ROGUE/BertScore
  • 80% agreement with external judge LM (deepseek)

RL with Unverifiable Rewards!

Similar Articles

@neural_avb: https://x.com/neural_avb/status/2063907440509571354

X AI KOLs Timeline

Explores a common failure mode in recursive language models (RLMs) where free-text subagent responses cause issues, and presents a solution using structured outputs to improve reliability, illustrated with a long-context question-answering example from NarrativeQA.