Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

Hugging Face Daily Papers 06/01/26, 12:00 AM Papers

Summary

Proposes Chunk-Level Guided Generation, a training-free method using off-the-shelf LLMs as process scorers to select fixed-length candidate chunks during small model generation, significantly improving mathematical reasoning accuracy over majority voting and PRM guided search.

Selecting the best response from multiple small-model samples using a stronger scorer is a simple inference-time strategy, but fails when the small model has already committed to incorrect reasoning paths. PRM guided search avoids this by scoring candidate continuations during generation, but requires a reward model trained with step-level labels. We propose Chunk-Level Guided Generation, a training-free alternative that uses an off-the-shelf large language model as a process scorer. At each step, a small model samples k fixed-length candidate chunks, while the larger model scores the candidates using likelihoods without generating any text. The selected chunk is committed before the next step, steering generation before errors can propagate. We instantiate this framework with two selection rules: Likelihood-Guided Selection (LGS), which selects the chunk with the highest length-normalized large-model log-probability, and Contrastive-Guided Selection (CGS), which subtracts the small model's log-probability to favor chunks where the large model's preference diverges from the small model's. We show that scoring variable-length reasoning steps with large-model likelihoods is unreliable due to a systematic length bias that persists even after length normalization, and that fixed-length chunks avoid this confound. On GSM8K, MATH, Minerva Math, AMC23, and AIME24 with Qwen2.5-1.5B guided by Qwen2.5-32B and Llama-3.2-1B guided by Llama-3.1-70B, CGS outperforms majority voting by up to 28 pp and, under matched guidance budgets, matches or outperforms Qwen2.5-Math-PRM-72B guided search on most benchmarks without reward-model training. With Qwen2.5-7B guided by Qwen2.5-72B, CGS reaches 81.8% on MATH and 63.6% on Minerva Math at k=16, surpassing majority voting by 4--6 pp. Finally, Chunk-Level Guided Generation produces substantially shorter reasoning traces than PRM guided search.

Original Article

View Cached Full Text

Cached at: 06/02/26, 03:37 PM

Paper page - Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

Source: https://huggingface.co/papers/2606.01682

Abstract

Chunk-Level Guided Generation uses a large language model as a process scorer to select fixed-length candidate chunks during small model generation, improving reasoning accuracy over traditional methods like majority voting and PRM guided search.

Selecting the best response from multiple small-model samples using a stronger scorer is a simple inference-time strategy, but fails when the small model has already committed to incorrect reasoning paths.PRM guided searchavoids this by scoring candidate continuations during generation, but requires areward modeltrained with step-level labels. We proposeChunk-Level Guided Generation, a training-free alternative that uses an off-the-shelflarge language modelas aprocess scorer. At each step, a small model samples k fixed-length candidate chunks, while the larger model scores the candidates using likelihoods without generating any text. The selected chunk is committed before the next step, steering generation before errors can propagate. We instantiate this framework with two selection rules:Likelihood-Guided Selection(LGS), which selects the chunk with the highest length-normalized large-model log-probability, andContrastive-Guided Selection(CGS), which subtracts the small model’s log-probability to favor chunks where the large model’s preference diverges from the small model’s. We show that scoring variable-length reasoning steps with large-model likelihoods is unreliable due to a systematic length bias that persists even after length normalization, and that fixed-length chunks avoid this confound. On GSM8K, MATH, Minerva Math, AMC23, and AIME24 with Qwen2.5-1.5B guided by Qwen2.5-32B and Llama-3.2-1B guided by Llama-3.1-70B, CGS outperformsmajority votingby up to 28 pp and, under matched guidance budgets, matches or outperforms Qwen2.5-Math-PRM-72Bguided searchon most benchmarks without reward-model training. With Qwen2.5-7B guided by Qwen2.5-72B, CGS reaches 81.8% on MATH and 63.6% on Minerva Math at k=16, surpassingmajority votingby 4--6 pp. Finally,Chunk-Level Guided Generationproduces substantially shorterreasoning traces thanPRM guided search.

View arXiv page View PDF Project page Add to collection

Get this paper in your agent:

hf papers read 2606\.01682

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.01682 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.01682 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.01682 in a Space README.md to link it from this page.

Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

Paper page - Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper1

Similar Articles

Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms

Geometric Latent Reasoning Induces Shorter Generations in LLMs

@stevibe: Which LLMs actually love to think? Tested 7 models on 5 math problems, measured reasoning length. The think winners: bo…

Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

Unsupervised Process Reward Models

Submit Feedback

Similar Articles

Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms

Geometric Latent Reasoning Induces Shorter Generations in LLMs

@stevibe: Which LLMs actually love to think? Tested 7 models on 5 math problems, measured reasoning length. The think winners: bo…
Benchmarked 7 LLMs on 5 math problems; Qwen3.5 27B and 35B A3B generated the longest reasoning chains, exceeding 10k tokens per question.

Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

Unsupervised Process Reward Models