@neural_avb: https://x.com/neural_avb/status/2063907440509571354
Summary
Explores a common failure mode in recursive language models (RLMs) where free-text subagent responses cause issues, and presents a solution using structured outputs to improve reliability, illustrated with a long-context question-answering example from NarrativeQA.
View Cached Full Text
Cached at: 06/08/26, 03:27 PM
RLM Agents live healthier when they talk via Structured Outputs
We got one goal today: understand one of the common failure modes of Recursive Language Models, and a simple cool way to reduce it.
Recursive Language Models (RLMs) let a model answer questions over a context far larger than its window by treating the prompt as a variable inside a Python REPL.
If you are unfamiliar with RLMs, I’ll encourage to read this article first that covers how RLM works and how it differs from other beloved techniques, like ReAct, CodeAct, and vanilla subagents.
AVB@neural_avb·Mar 21 ArticleRecursive Language Models - what finally gave me the ‘aha’ momentI have spent a decent chunk of last month implementing RLMs from scratch, and producing a 50 minute tutorial video on it. Throughout the process, I responded to 100+ questions on Youtube and X about…19119874171K
The agent writes code to search, slice, and chunk the text, and can recursively spawn sub-agents over the pieces. The subagent answers come back inside the REPL values, i.e. the responses are never auto-dumped into the parent’s context directly.
**Spawning swarms of subagents to divide and conquer a task is super cool! But RLM traces make or break depending on its capability of prompting subagents correctly about:
(a) what it should do, (b) what it should return back, and (c) how the main agent aggregates the subagent responses into a final response.**
As the big labs begin to train their models inside RLM harnesses, I am sure we will see language models naturally evolve to do these tasks inside an REPL in an efficient way. But currently, models work inside an REPL purely thorough in-context learning of RLM patterns and raw programming skill.
In other words, its the harness developer’s job to make life as easy as possible for language models inside an RLM.
We hit all those problems on one LongBench / NarrativeQA sample:
Context: the full ~107K-character text of Henry James’s **The Coxon Fund. **Basically its a story book.
Question: “What is Saltram’s living situation?” (Saltram is a character in this story)
Correct answer: “He is a guest in the home of the Mulvilles.”
The truth is never said outright. Across the novel, the character Frank Saltram is the permanent house-guest (“inmate”) of the Mulvilles at their Wimbledon home.
Why is this a skill-check task for long-context models?
This fact you only actually get by reading and connecting. Not by keyword lookup.
The literal token “Saltram” often appears nowhere near the passages that actually describe where he lives.
How a RLM could solve such long-context tasks
When the RLM is presented a problem, one of it’s first goal is to decide whether to attack the problem head-on, or deploy subagents.
**Attacking head-on **would mean the model would try to slice and dice various sections of the input context itself, print out these sections in its own REPL and figure out the answer.
Deploying subagents would mean it will create shorter slices of the original context, get them to solve partial problems, and then aggregate their findings into one coherent response. Divide and Conquer
One way that an RLM can solve the Saltram problem is actually attacking it head on and do it inside a depth-0 REPL.
That is the correct answer, it costs ~$0.04 with Minimax M3. And it would work. The model will search and slice contexts around key data points and just read those relevant sections!
But what if the RLM went for a subagent approach?
Subagents v1: Free-text fan-out (the failure)
In this first instance, the agent’s instinct was a textbook RLM move: chunk the context and map a sub-agent over each chunk, then reduce. This is one of the common patterns that almost every RLM system prompt has, so its not at all an invalid option.
It split the ~107K characters into fixed-size chunks and fanned out free-text sub-agents in parallel, then asked one more to aggregate the responses! Very cool, but there is an issue you will soon see:
Then it delegated a second LM to resolve the answer:
Pretty simple technique. Spawn a swarm (of 62 subagents) to analyze multiple regions of the context and then get another agent to summarize it. But something bad happened:
Here are some of the responses it got back from those 62 subagents:
-
sub2: “The passage does NOT contain direct verbatim quotes describing Saltram’s living situation.”
-
sub4: “The text does not contain any sentences describing where Saltram lives. Mrs. Saltram is mentioned only in passing…”
-
sub5: “The provided text excerpt does not contain any passages about Saltram’s home, her ‘set of chambers’, or where she lives.”
-
sub0, sub1, sub3: …
The result?
The second subagent got confused because it had 62 responses to classify, and all of them were text-based. I ran a max-depth=1 RLM, so at the second level, the subagents aren’t allowed to spawn newer agents.
What followed was a long flail — the root couldn’t cleanly read the sub-agent’s return (it kept printing), and eventually it just hand-wrote an answer:
It did a bunch of things but none of it worked. This is not the correct answer. It got overwhelmed with the 62 subagents’ short answers.
If only instead of free-form text, subagents were forced to return a structured response.
Subagents v2: Structured outputs routing (success)
The second run did fan out, but used structured output to do it cleanly.
Instead of asking sub-agents for prose (“describe his living situation”), it asked each one a True/False question with a JSON-Schema-constrained answer, then read the chunks that said True.
This may look very similar, but if you squint, there is one MAJOR MAJOR thing that happened here:
In the fast-rlm library, the structured I/O works in a simple way. When a subagent (or main agent) receives a schema request, we validate that the response it is sending back to the main agent perfectly adheres to the schema. No exceptions.
AVB@neural_avb·May 20New version of fast-rlm out today (v1.14)
New features in this release:
- Input to RLM need not be string, can be any python dictionary
- Output schema declaration -> RLM is guaranteed to return output in your designed structured output
- Agents can call subagent with explicitShow more41514719K
You can do more complex schema as well containing nested lists and objects (anything you can define in zod or pydantic), and the library puts a validation to ensure that the schema is always satisfied.
So now, instead of the model trying to parse a 40 variations of: No mention of Saltram’s living condition in this paragraph, it can just directly look at this one boolean flag and make everything stick!
Lower chances of hallucination because the model never had to read way too much of the story into it’s context at once!
Verdict: correct!
In the first no-subagent depth-0 approach and this depth-1 approach, we effectively used an equivalent amount of tokens. In fact both cost around ~0.04$ with Minimax-M3
But the scope of hallucination is much reduced in the subagent approach since you are not looking at large bodies of unrelated confounding text! Low powered reasoning models are totally capable of losing the plot when they read too many tokens all at once.
The booleans acted as a direct attention mask to the original context! External sparsification of the input prompt!
Note: boolean schema is just an example that the RLM picked here. In theory, an agent can pick any schema requirements. They all get validated and ensured before passing back!
Validating structured output inside RLMs
To wrap up, I’ll mention how the structured output stuff is implemented inside RLMs.
Structured Out mode is not just for main agents to call their subagents! The user can also enforce this contract with the root agent.
Structured Out mode is not just for main agents to call their subagents! The user can also enforce this contract with the root agent.
- Schema normalization (Python)
You/agents can pass the desired output schema: Pydantic model, a primitive type like int, a list[Model] generic, or a raw JSON Schema dict. We convert all that to a plain JSON Schema (model_json_schema() for Pydantic, a TypeAdapter for generics, a lookup for primitives).
2. The agent is shown the contract at REPL startup
At step 0, before any of the agent’s work begins we display the desired JSON schema to the agent. So the model knows the exact shape it must return before it writes any code.
- Validate on every
FINAL
When an LLM calls FINAL(answer) inside the REPL, we take the content of answer and return it from the subagents back to its calling agent. But juuuuust before we return the answer, we do a schema validation check!
If the validation passes, we return the result to the main agent as expected. But if it fails, we send a feedback to the current agent telling the exact format validation errors, and the expected format it must enforce.
4. Retry, don’t restart
On failure the agent receives the schema and the specific errors (e.g. (root): must be boolean). The REPL work is untouched, so the model just needs to fix the value and re-calls FINAL
We validate again and approve if schema matches!
Just by this simple validation mechanism, RLMs can unlock a whole new dimension to operate. Passing exact contract requirements to subagents which can be used directly inside the REPL.
Check out the fast-rlm repo here: https://github.com/avbiswas/fast-rlm
Similar Articles
@TDataScience: Follow along @neural_avb's all-in-one deep dive to learn "what recursive language models (RLMs) are, why they are winni…
An educational deep dive into recursive language models (RLMs), explaining what they are, why they are winning long-context benchmarks, and how they differ from existing agentic harness designs like ReAct or CodeAct, using a simple case study.
ACL-Verbatim: hallucination-free question answering for research
ACL-Verbatim introduces a family of lightweight extractive models for grounded RAG that return exact text spans from source, outperforming larger LLM-based extractors.
Asking For An Old Friend: Diagnosing and Mitigating Temporal Failure Modes in LLM-based Statutory Question Answering
This paper studies temporal failure modes in LLM-based statutory question answering, including post-cutoff staleness and recency bias. It introduces a benchmark of 312 expert-validated German statutory QA pairs and evaluates LLMs under various inference settings.
@neural_avb: One of the the coolest RLM trajectories that made me go "woah" RLMs (Minimax M3) launching subagent swarms with clear p…
Neural_avb highlights how Minimax M3's RLMs use subagent swarms with pydantic contracts for type checking and schema validation, reducing hallucination rates and failed subagent calls.
AgentNLQ: A General-Purpose Agent for Natural Language to SQL
This paper presents AgentNLQ, a multi-agent system for natural language to SQL conversion that achieves 78.1% semantic accuracy on the BIRD benchmark through schema enrichment and a self-correcting orchestrator.