@QGallouedec: multi-turn RL and the "tito" problem keeps coming up. we've been working on it for a while, and the takeaway is that it…
Summary
A developer shares that addressing the 'tito' problem in multi-turn reinforcement learning is simpler than commonly believed, requiring only one implementation rule and a chat-template property that models already support.
View Cached Full Text
Cached at: 05/29/26, 11:45 AM
multi-turn RL and the “tito” problem keeps coming up. we’ve been working on it for a while, and the takeaway is that it’s much easier than people are making it.
it takes 1 implementation rule, and 1 chat-template property that all models already comply with.
**that’s all you https://t.co/O7BeRiPi5Y
Similar Articles
Agentic RL: Token-In, Token-Out Done Right (16 minute read)
This article explains the 'Token-In, Token-Out' (TITO) invariant in reinforcement learning for LLMs, highlighting a common error when training multi-turn agents with tool calls. It presents two solutions: using per-model renderers or designing training to avoid re-encoding decoded tokens, emphasizing prefix-preserving chat templates.
Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL
This paper addresses the 'Lost in Conversation' problem where LLMs struggle with information revealed across multiple turns. It proposes a scalable sharding pipeline to create multi-turn training data from single-turn QA datasets and uses reinforcement learning with verifiable rewards to train a memory-augmented policy that maintains a compact rolling memory, improving multi-turn reasoning accuracy and generalizing zero-shot to harder tasks.
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
UI-TARS-2 is a native GUI-centered agent model that addresses data scalability, multi-turn RL, and environment stability challenges, achieving state-of-the-art results on GUI benchmarks (88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena,73.3 on AndroidWorld) and outperforming Claude and OpenAI agents.
@blc_16: MIT just released a new RL method called Pedagogical RL. The main lesson -> correct reasoning traces can still be bad t…
MIT introduces Pedagogical RL, a method that trains a teacher to produce trajectories that are learnable for a student by penalizing surprising steps, improving RL training efficiency.
OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents
OpenWebRL presents an open framework for training visual web agents using online multi-turn reinforcement learning on real websites, achieving state-of-the-art performance with minimal initial supervision. Their 4B-parameter model outperforms prior open agents and competes with proprietary systems like OpenAI CUA and Gemini CUA.