Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling

Hugging Face Daily Papers 06/02/26, 03:42 AM Papers

Summary

This paper formulates adaptive sampling for large language models as a Markov decision process and trains a lightweight RL controller to balance correctness, latency, and computational cost, achieving improved trade-offs.

Test-time scaling improves the reasoning performance of large language models but incurs substantial cost in both total computation and latency. Existing adaptive sampling methods partially mitigate this issue by dynamically deciding when to stop sampling, yet they typically rely on heuristic rules or rely on distribution assumptions. In this work, we formulate adaptive sampling as a Markov decision process (MDP). We train a lightweight sampling controller with reinforcement learning (RL) to jointly balance answer correctness, latency, and computation cost. At each round, the controller decides to stop sampling or to acquire additional samples. Our method is lightweight which only relies on statistics of final answers, and can be trained and deployed on CPU. We further show that the resulting framework admits an interpretation as the Lagrangian relaxation of a constrained optimization problem with explicit budget constraints. Experiments against strong baselines such as ASC and ESC show that our method achieves improved trade-offs among answer correctness, sampling rounds, and total samples required.

Original Article

View Cached Full Text

Cached at: 06/03/26, 07:36 AM

Paper page - Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling

Source: https://huggingface.co/papers/2606.03102

Abstract

Adaptive sampling for large language models is formulated as a Markov decision process and optimized using reinforcement learning to balance correctness, latency, and computational cost.

Test-time scaling improves the reasoning performance oflarge language modelsbut incurs substantial cost in both total computation and latency. Existingadaptive samplingmethods partially mitigate this issue by dynamically deciding when to stop sampling, yet they typically rely on heuristic rules or rely on distribution assumptions. In this work, we formulateadaptive samplingas aMarkov decision process(MDP). We train a lightweight sampling controller withreinforcement learning(RL) to jointly balance answer correctness, latency, and computation cost. At each round, the controller decides to stop sampling or to acquire additional samples. Our method is lightweight which only relies on statistics of final answers, and can be trained and deployed on CPU. We further show that the resulting framework admits an interpretation as theLagrangian relaxationof aconstrained optimizationproblem with explicit budget constraints. Experiments against strong baselines such as ASC and ESC show that our method achieves improved trade-offs among answer correctness, sampling rounds, and total samples required.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2606\.03102

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.03102 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.03102 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.03102 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling

Paper page - Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

LEEPS: Latent-Guided Explore-Exploit Prompt Sampling for Efficient RLVR in Large Language Models

Towards Robust Reinforcement Learning for Small-Scale Language Model Agents

Towards Robust Reinforcement Learning for Small-Scale Language Model Agents

Scaling with Confidence: Calibrating Confidence of LLMs for Adaptive Test Time Scaling

Towards Scalable Multi-Task Reinforcement Learning with Large Decision Models

Submit Feedback

Similar Articles

LEEPS: Latent-Guided Explore-Exploit Prompt Sampling for Efficient RLVR in Large Language Models

Towards Robust Reinforcement Learning for Small-Scale Language Model Agents

Towards Robust Reinforcement Learning for Small-Scale Language Model Agents

Scaling with Confidence: Calibrating Confidence of LLMs for Adaptive Test Time Scaling

Towards Scalable Multi-Task Reinforcement Learning with Large Decision Models