@QGallouedec: multi-turn RL and the "tito" problem keeps coming up. we've been working on it for a while, and the takeaway is that it…

X AI KOLs Following 05/28/26, 09:28 AM Tools

Summary

A developer shares that addressing the 'tito' problem in multi-turn reinforcement learning is simpler than commonly believed, requiring only one implementation rule and a chat-template property that models already support.

multi-turn RL and the "tito" problem keeps coming up. we've been working on it for a while, and the takeaway is that it's much easier than people are making it. it takes 1 implementation rule, and 1 chat-template property that all models already comply with. **that's all you https://t.co/O7BeRiPi5Y

Original Article

View Cached Full Text

Cached at: 05/29/26, 11:45 AM

multi-turn RL and the “tito” problem keeps coming up. we’ve been working on it for a while, and the takeaway is that it’s much easier than people are making it.

it takes 1 implementation rule, and 1 chat-template property that all models already comply with.

**that’s all you https://t.co/O7BeRiPi5Y

Similar Articles

Agentic RL: Token-In, Token-Out Done Right (16 minute read)

TLDR AI

This article explains the 'Token-In, Token-Out' (TITO) invariant in reinforcement learning for LLMs, highlighting a common error when training multi-turn agents with tool calls. It presents two solutions: using per-model renderers or designing training to avoid re-encoding decoded tokens, emphasizing prefix-preserving chat templates.

Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL

arXiv cs.CL

This paper addresses the 'Lost in Conversation' problem where LLMs struggle with information revealed across multiple turns. It proposes a scalable sharding pipeline to create multi-turn training data from single-turn QA datasets and uses reinforcement learning with verifiable rewards to train a memory-augmented policy that maintains a compact rolling memory, improving multi-turn reasoning accuracy and generalizing zero-shot to harder tasks.

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Papers with Code Trending

UI-TARS-2 is a native GUI-centered agent model that addresses data scalability, multi-turn RL, and environment stability challenges, achieving state-of-the-art results on GUI benchmarks (88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena,73.3 on AndroidWorld) and outperforming Claude and OpenAI agents.

@blc_16: MIT just released a new RL method called Pedagogical RL. The main lesson -> correct reasoning traces can still be bad t…

X AI KOLs Following

MIT introduces Pedagogical RL, a method that trains a teacher to produce trajectories that are learnable for a student by penalizing surprising steps, improving RL training efficiency.

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

Hugging Face Daily Papers

OpenWebRL presents an open framework for training visual web agents using online multi-turn reinforcement learning on real websites, achieving state-of-the-art performance with minimal initial supervision. Their 4B-parameter model outperforms prior open agents and competes with proprietary systems like OpenAI CUA and Gemini CUA.