A Verifiable Search Is Not a Learnable Chain-of-Thought

Hugging Face Daily Papers 06/20/26, 12:00 AM Papers

chain-of-thought reasoning search fine-tuning distillation verification lora

Summary

This paper demonstrates that training models on chain-of-thought demonstrations fails for tasks requiring backtracking search, showing that search procedures cannot be faithfully imitated. The authors find that even when models perform well on sub-components, they cannot carry forward a left-to-right derivation for cryptarithmetic tasks.

It is tempting to assume any task solvable by a short program can be taught to a model as its chain-of-thought: write the steps out, fine-tune, and the model follows. This paper shows the assumption fails for an identifiable class of procedures. The testbed is nine reasoning tasks, each from a deterministic generator; public and hidden splits share generators, so held-out data proxies test accuracy. I reverse-engineer the generators into Python solvers, render them as chain-of-thought, and distill into a rank-<= 32 LoRA over a 30B (3.5B-active) Nemotron model. Forward-computable tasks install readily: lookup/arithmetic and an 8-bit boolean task transfer (>= 0.99 and 0.68). Cryptarithm does not: distilling its backtracking search holds at 0.01-0.07 across eleven chain-of-thought designs, RL from verifiable rewards, and self-training, even though a search solver answers 71% of instances. This is not a capability gap. The model does the arithmetic on 97-100% of lines and ranks the correct cipher in its top eight on 71%; it cannot carry the search forward as a left-to-right derivation. Fine-tuning learns the shape of a verifiable elimination step while its verdicts become unconditional templates, correct only 16-57% of the time ("verdict-as-token"). The ceiling holds across backbones from 3B to 671B and across fine-tuning and prompting; a controlled intervention isolates the cause: revealing the cipher key, which turns the derivation forward, lifts the same instances from 0.03 to 0.57. When a procedure's only solution is search over information-free structure, no faithful forward chain-of-thought exists to imitate. The task becomes learnable only by removing the search, precomputing its combinatorial core into a catalog and reducing the trace to recall plus verification; the 1st-place solution reaches Private LB 0.92 this way. What distills is memorization and verification, not search.

Original Article

View Cached Full Text

Cached at: 06/23/26, 05:43 PM

Paper page - A Verifiable Search Is Not a Learnable Chain-of-Thought

Source: https://huggingface.co/papers/2606.21884

Abstract

Training models on chain-of-thought demonstrations fails for tasks requiring backtracking search because the forward derivation cannot be faithfully imitated, demonstrating a fundamental limitation in learning search procedures through demonstration.

It is tempting to assume any task solvable by a short program can be taught to a model as itschain-of-thought: write the steps out, fine-tune, and the model follows. This paper shows the assumption fails for an identifiable class of procedures. The testbed is nine reasoning tasks, each from a deterministic generator; public and hidden splits share generators, so held-out data proxies test accuracy. I reverse-engineer the generators into Python solvers, render them aschain-of-thought, and distill into a rank-<= 32LoRAover a 30B (3.5B-active)Nemotronmodel.Forward-computabletasks install readily: lookup/arithmetic and an 8-bit boolean task transfer (>= 0.99 and 0.68). Cryptarithm does not: distilling itsbacktracking searchholds at 0.01-0.07 across elevenchain-of-thoughtdesigns, RL fromverifiable rewards, andself-training, even though a search solver answers 71% of instances. This is not a capability gap. The model does the arithmetic on 97-100% of lines and ranks the correct cipher in its top eight on 71%; it cannot carry the search forward as a left-to-right derivation.Fine-tuninglearns the shape of a verifiable elimination step while its verdicts become unconditional templates, correct only 16-57% of the time (“verdict-as-token”). The ceiling holds across backbones from 3B to 671B and acrossfine-tuningand prompting; a controlled intervention isolates the cause: revealing the cipher key, which turns the derivation forward, lifts the same instances from 0.03 to 0.57. When a procedure’s only solution is search over information-free structure, no faithful forwardchain-of-thoughtexists to imitate. The task becomes learnable only by removing the search, precomputing its combinatorial core into a catalog and reducing the trace to recall plusverification; the 1st-place solution reaches Private LB 0.92 this way. What distills ismemorizationandverification, not search.

View arXiv page View PDF Project page GitHub1 Add to collection

Get this paper in your agent:

hf papers read 2606\.21884

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.21884 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.21884 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.21884 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

A Verifiable Search Is Not a Learnable Chain-of-Thought

Paper page - A Verifiable Search Is Not a Learnable Chain-of-Thought

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts

Reasoning models struggle to control their chains of thought, and that’s good

Training on Documents About Monitoring Leads to CoT Obfuscation

Reasoning Models Don't Just Think Longer, They Move Differently

The strange thing about LLM reasoning research: we're now trying to remove the chain-of-thought traces

Submit Feedback

Similar Articles

Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts

Reasoning models struggle to control their chains of thought, and that’s good

Training on Documents About Monitoring Leads to CoT Obfuscation

Reasoning Models Don't Just Think Longer, They Move Differently

The strange thing about LLM reasoning research: we're now trying to remove the chain-of-thought traces