Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling
Summary
This paper formulates adaptive sampling for large language models as a Markov decision process and trains a lightweight RL controller to balance correctness, latency, and computational cost, achieving improved trade-offs.
View Cached Full Text
Cached at: 06/03/26, 07:36 AM
Paper page - Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling
Source: https://huggingface.co/papers/2606.03102
Abstract
Adaptive sampling for large language models is formulated as a Markov decision process and optimized using reinforcement learning to balance correctness, latency, and computational cost.
Test-time scaling improves the reasoning performance oflarge language modelsbut incurs substantial cost in both total computation and latency. Existingadaptive samplingmethods partially mitigate this issue by dynamically deciding when to stop sampling, yet they typically rely on heuristic rules or rely on distribution assumptions. In this work, we formulateadaptive samplingas aMarkov decision process(MDP). We train a lightweight sampling controller withreinforcement learning(RL) to jointly balance answer correctness, latency, and computation cost. At each round, the controller decides to stop sampling or to acquire additional samples. Our method is lightweight which only relies on statistics of final answers, and can be trained and deployed on CPU. We further show that the resulting framework admits an interpretation as theLagrangian relaxationof aconstrained optimizationproblem with explicit budget constraints. Experiments against strong baselines such as ASC and ESC show that our method achieves improved trade-offs among answer correctness, sampling rounds, and total samples required.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.03102
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.03102 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.03102 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.03102 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
CASCADE: Case-Based Continual Adaptation for Large Language Models During Deployment
This paper introduces CASCADE, a framework for deployment-time learning that allows Large Language Models to adapt continuously through episodic memory and contextual bandit optimization without modifying model parameters.
Reinforcing Recursive Language Models (18 minute read)
The article explores reinforcement learning fine-tuning of small (4B) recursive language models (RLMs) to perform evidence selection from scientific documents, showing that RL-trained 4B models match Claude Sonnet 4.6 performance at a fraction of the size and cost.
ReAD: Reinforcement-Guided Capability Distillation for Large Language Models
This paper introduces ReAD, a reinforcement-guided capability distillation framework that optimizes token budgets by accounting for cross-capability transfer in large language models. It demonstrates improved downstream utility and reduced harmful spillover compared to existing baselines.
Discovering Reinforcement Learning Interfaces with Large Language Models
This paper introduces LIMEN, an LLM-guided evolutionary framework that automatically discovers reinforcement learning interfaces by jointly optimizing observation mappings and reward functions from raw simulator states. The approach reduces manual engineering effort and demonstrates that co-designing observations and rewards outperforms optimizing either component alone.
UniScale: Adaptive Unified Inference Scaling via Online Joint Optimization of Model Routing and Test-Time Scaling
Proposes UniScale, an online framework that unifies model routing and test-time scaling via contextual bandit optimization for better quality-cost trade-offs in LLM inference.