Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning
Summary
Proposes Chunk-Level Guided Generation, a training-free method using off-the-shelf LLMs as process scorers to select fixed-length candidate chunks during small model generation, significantly improving mathematical reasoning accuracy over majority voting and PRM guided search.
View Cached Full Text
Cached at: 06/02/26, 03:37 PM
Paper page - Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning
Source: https://huggingface.co/papers/2606.01682
Abstract
Chunk-Level Guided Generation uses a large language model as a process scorer to select fixed-length candidate chunks during small model generation, improving reasoning accuracy over traditional methods like majority voting and PRM guided search.
Selecting the best response from multiple small-model samples using a stronger scorer is a simple inference-time strategy, but fails when the small model has already committed to incorrect reasoning paths.PRM guided searchavoids this by scoring candidate continuations during generation, but requires areward modeltrained with step-level labels. We proposeChunk-Level Guided Generation, a training-free alternative that uses an off-the-shelflarge language modelas aprocess scorer. At each step, a small model samples k fixed-length candidate chunks, while the larger model scores the candidates using likelihoods without generating any text. The selected chunk is committed before the next step, steering generation before errors can propagate. We instantiate this framework with two selection rules:Likelihood-Guided Selection(LGS), which selects the chunk with the highest length-normalized large-model log-probability, andContrastive-Guided Selection(CGS), which subtracts the small model’s log-probability to favor chunks where the large model’s preference diverges from the small model’s. We show that scoring variable-length reasoning steps with large-model likelihoods is unreliable due to a systematic length bias that persists even after length normalization, and that fixed-length chunks avoid this confound. On GSM8K, MATH, Minerva Math, AMC23, and AIME24 with Qwen2.5-1.5B guided by Qwen2.5-32B and Llama-3.2-1B guided by Llama-3.1-70B, CGS outperformsmajority votingby up to 28 pp and, under matched guidance budgets, matches or outperforms Qwen2.5-Math-PRM-72Bguided searchon most benchmarks without reward-model training. With Qwen2.5-7B guided by Qwen2.5-72B, CGS reaches 81.8% on MATH and 63.6% on Minerva Math at k=16, surpassingmajority votingby 4--6 pp. Finally,Chunk-Level Guided Generationproduces substantially shorterreasoning traces thanPRM guided search.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2606\.01682
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.01682 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.01682 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.01682 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms
This paper investigates how large language models perform arithmetic operations by analyzing internal mechanisms through early decoding, revealing that proficient models exhibit a clear division of labor between attention and MLP modules in reasoning tasks.
Geometric Latent Reasoning Induces Shorter Generations in LLMs
Geometric Latent Reasoning (GLR) introduces a geometric path-approximation method for latent reasoning in LLMs, enabling shorter generations while maintaining accuracy across mathematical reasoning benchmarks.
@stevibe: Which LLMs actually love to think? Tested 7 models on 5 math problems, measured reasoning length. The think winners: bo…
Benchmarked 7 LLMs on 5 math problems; Qwen3.5 27B and 35B A3B generated the longest reasoning chains, exceeding 10k tokens per question.
Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers
This paper introduces a mutual reasoning technique that enhances the problem-solving capabilities of smaller LLMs by iteratively refining candidate solutions through self-feedback and reward functions.
Unsupervised Process Reward Models
This paper proposes unsupervised Process Reward Models (uPRM) that eliminate the need for human annotations by using LLM next-token probabilities to identify erroneous reasoning steps, achieving up to 15% accuracy improvements over LLM-as-a-Judge and performing comparably to supervised PRMs as verifiers and reward signals.