A Verifiable Search Is Not a Learnable Chain-of-Thought

Hugging Face Daily Papers Papers

Summary

This paper demonstrates that training models on chain-of-thought demonstrations fails for tasks requiring backtracking search, showing that search procedures cannot be faithfully imitated. The authors find that even when models perform well on sub-components, they cannot carry forward a left-to-right derivation for cryptarithmetic tasks.

It is tempting to assume any task solvable by a short program can be taught to a model as its chain-of-thought: write the steps out, fine-tune, and the model follows. This paper shows the assumption fails for an identifiable class of procedures. The testbed is nine reasoning tasks, each from a deterministic generator; public and hidden splits share generators, so held-out data proxies test accuracy. I reverse-engineer the generators into Python solvers, render them as chain-of-thought, and distill into a rank-<= 32 LoRA over a 30B (3.5B-active) Nemotron model. Forward-computable tasks install readily: lookup/arithmetic and an 8-bit boolean task transfer (>= 0.99 and 0.68). Cryptarithm does not: distilling its backtracking search holds at 0.01-0.07 across eleven chain-of-thought designs, RL from verifiable rewards, and self-training, even though a search solver answers 71% of instances. This is not a capability gap. The model does the arithmetic on 97-100% of lines and ranks the correct cipher in its top eight on 71%; it cannot carry the search forward as a left-to-right derivation. Fine-tuning learns the shape of a verifiable elimination step while its verdicts become unconditional templates, correct only 16-57% of the time ("verdict-as-token"). The ceiling holds across backbones from 3B to 671B and across fine-tuning and prompting; a controlled intervention isolates the cause: revealing the cipher key, which turns the derivation forward, lifts the same instances from 0.03 to 0.57. When a procedure's only solution is search over information-free structure, no faithful forward chain-of-thought exists to imitate. The task becomes learnable only by removing the search, precomputing its combinatorial core into a catalog and reducing the trace to recall plus verification; the 1st-place solution reaches Private LB 0.92 this way. What distills is memorization and verification, not search.
Original Article
View Cached Full Text

Cached at: 06/23/26, 05:43 PM

Paper page - A Verifiable Search Is Not a Learnable Chain-of-Thought

Source: https://huggingface.co/papers/2606.21884

Abstract

Training models on chain-of-thought demonstrations fails for tasks requiring backtracking search because the forward derivation cannot be faithfully imitated, demonstrating a fundamental limitation in learning search procedures through demonstration.

It is tempting to assume any task solvable by a short program can be taught to a model as itschain-of-thought: write the steps out, fine-tune, and the model follows. This paper shows the assumption fails for an identifiable class of procedures. The testbed is nine reasoning tasks, each from a deterministic generator; public and hidden splits share generators, so held-out data proxies test accuracy. I reverse-engineer the generators into Python solvers, render them aschain-of-thought, and distill into a rank-<= 32LoRAover a 30B (3.5B-active)Nemotronmodel.Forward-computabletasks install readily: lookup/arithmetic and an 8-bit boolean task transfer (>= 0.99 and 0.68). Cryptarithm does not: distilling itsbacktracking searchholds at 0.01-0.07 across elevenchain-of-thoughtdesigns, RL fromverifiable rewards, andself-training, even though a search solver answers 71% of instances. This is not a capability gap. The model does the arithmetic on 97-100% of lines and ranks the correct cipher in its top eight on 71%; it cannot carry the search forward as a left-to-right derivation.Fine-tuninglearns the shape of a verifiable elimination step while its verdicts become unconditional templates, correct only 16-57% of the time (“verdict-as-token”). The ceiling holds across backbones from 3B to 671B and acrossfine-tuningand prompting; a controlled intervention isolates the cause: revealing the cipher key, which turns the derivation forward, lifts the same instances from 0.03 to 0.57. When a procedure’s only solution is search over information-free structure, no faithful forwardchain-of-thoughtexists to imitate. The task becomes learnable only by removing the search, precomputing its combinatorial core into a catalog and reducing the trace to recall plusverification; the 1st-place solution reaches Private LB 0.92 this way. What distills ismemorizationandverification, not search.

View arXiv pageView PDFProject pageGitHub1Add to collection

Get this paper in your agent:

hf papers read 2606\.21884

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.21884 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.21884 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.21884 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts

arXiv cs.CL

This research paper from MediaTek and National Taiwan University challenges the assumption that reasoning chains must be dense and sequential, showing that models can extract answers from sparse, shuffled, and noisy reasoning traces. The findings suggest that answer extraction is robust and order-independent, potentially enabling more efficient, parallelized reasoning generation.

Reasoning models struggle to control their chains of thought, and that’s good

OpenAI Blog

OpenAI researchers study whether reasoning models can deliberately obscure their chain-of-thought to evade monitoring, finding that current models struggle to control their reasoning even when aware of monitoring. They introduce CoT-Control, an open-source evaluation suite with over 13,000 tasks to measure chain-of-thought controllability in reasoning models.

Reasoning Models Don't Just Think Longer, They Move Differently

arXiv cs.CL

This paper investigates whether reasoning-trained language models simply allocate more compute (longer chains of thought) or follow qualitatively different internal trajectories by analyzing hidden-state trajectory geometry across code, math, and SAT domains. After correcting for generation length, they find that reasoning-trained models exhibit distinct trajectory geometry—most clearly in code—indicating reasoning training changes how computation unfolds, not just how much is used.