Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

Hugging Face Daily Papers Papers

Summary

Proposes Chunk-Level Guided Generation, a training-free method using off-the-shelf LLMs as process scorers to select fixed-length candidate chunks during small model generation, significantly improving mathematical reasoning accuracy over majority voting and PRM guided search.

Selecting the best response from multiple small-model samples using a stronger scorer is a simple inference-time strategy, but fails when the small model has already committed to incorrect reasoning paths. PRM guided search avoids this by scoring candidate continuations during generation, but requires a reward model trained with step-level labels. We propose Chunk-Level Guided Generation, a training-free alternative that uses an off-the-shelf large language model as a process scorer. At each step, a small model samples k fixed-length candidate chunks, while the larger model scores the candidates using likelihoods without generating any text. The selected chunk is committed before the next step, steering generation before errors can propagate. We instantiate this framework with two selection rules: Likelihood-Guided Selection (LGS), which selects the chunk with the highest length-normalized large-model log-probability, and Contrastive-Guided Selection (CGS), which subtracts the small model's log-probability to favor chunks where the large model's preference diverges from the small model's. We show that scoring variable-length reasoning steps with large-model likelihoods is unreliable due to a systematic length bias that persists even after length normalization, and that fixed-length chunks avoid this confound. On GSM8K, MATH, Minerva Math, AMC23, and AIME24 with Qwen2.5-1.5B guided by Qwen2.5-32B and Llama-3.2-1B guided by Llama-3.1-70B, CGS outperforms majority voting by up to 28 pp and, under matched guidance budgets, matches or outperforms Qwen2.5-Math-PRM-72B guided search on most benchmarks without reward-model training. With Qwen2.5-7B guided by Qwen2.5-72B, CGS reaches 81.8% on MATH and 63.6% on Minerva Math at k=16, surpassing majority voting by 4--6 pp. Finally, Chunk-Level Guided Generation produces substantially shorter reasoning traces than PRM guided search.
Original Article
View Cached Full Text

Cached at: 06/02/26, 03:37 PM

Paper page - Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

Source: https://huggingface.co/papers/2606.01682

Abstract

Chunk-Level Guided Generation uses a large language model as a process scorer to select fixed-length candidate chunks during small model generation, improving reasoning accuracy over traditional methods like majority voting and PRM guided search.

Selecting the best response from multiple small-model samples using a stronger scorer is a simple inference-time strategy, but fails when the small model has already committed to incorrect reasoning paths.PRM guided searchavoids this by scoring candidate continuations during generation, but requires areward modeltrained with step-level labels. We proposeChunk-Level Guided Generation, a training-free alternative that uses an off-the-shelflarge language modelas aprocess scorer. At each step, a small model samples k fixed-length candidate chunks, while the larger model scores the candidates using likelihoods without generating any text. The selected chunk is committed before the next step, steering generation before errors can propagate. We instantiate this framework with two selection rules:Likelihood-Guided Selection(LGS), which selects the chunk with the highest length-normalized large-model log-probability, andContrastive-Guided Selection(CGS), which subtracts the small model’s log-probability to favor chunks where the large model’s preference diverges from the small model’s. We show that scoring variable-length reasoning steps with large-model likelihoods is unreliable due to a systematic length bias that persists even after length normalization, and that fixed-length chunks avoid this confound. On GSM8K, MATH, Minerva Math, AMC23, and AIME24 with Qwen2.5-1.5B guided by Qwen2.5-32B and Llama-3.2-1B guided by Llama-3.1-70B, CGS outperformsmajority votingby up to 28 pp and, under matched guidance budgets, matches or outperforms Qwen2.5-Math-PRM-72Bguided searchon most benchmarks without reward-model training. With Qwen2.5-7B guided by Qwen2.5-72B, CGS reaches 81.8% on MATH and 63.6% on Minerva Math at k=16, surpassingmajority votingby 4--6 pp. Finally,Chunk-Level Guided Generationproduces substantially shorterreasoning traces thanPRM guided search.

View arXiv pageView PDFProject pageAdd to collection

Get this paper in your agent:

hf papers read 2606\.01682

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.01682 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.01682 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.01682 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

Unsupervised Process Reward Models

Hugging Face Daily Papers

This paper proposes unsupervised Process Reward Models (uPRM) that eliminate the need for human annotations by using LLM next-token probabilities to identify erroneous reasoning steps, achieving up to 15% accuracy improvements over LLM-as-a-Judge and performing comparably to supervised PRMs as verifiers and reward signals.