LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

Hugging Face Daily Papers 05/08/26, 12:00 AM Papers

test-time-computing auto-tts reasoning controller-synthesis open-source llm-optimization

Summary

This paper introduces AutoTTS, an environment-driven framework that automates the discovery of test-time scaling strategies for LLMs by formulating it as controller synthesis. It demonstrates improved accuracy-cost tradeoffs on mathematical reasoning benchmarks with minimal computational overhead.

Test-time scaling (TTS) has become an effective approach for improving large language model performance by allocating additional computation during inference. However, existing TTS strategies are largely hand-crafted: researchers manually design reasoning patterns and tune heuristics by intuition, leaving much of the computation-allocation space unexplored. We propose an environment-driven framework, AutoTTS, that changes what researchers design: from individual TTS heuristics to environments where TTS strategies can be discovered automatically. The key to AutoTTS lies in environment construction: the discovery environment must make the control space tractable and provide cheap, frequent feedback for TTS search. As a concrete instantiation, we formulate width--depth TTS as controller synthesis over pre-collected reasoning trajectories and probe signals, where controllers decide when to branch, continue, probe, prune, or stop and can be evaluated cheaply without repeated LLM calls. We further introduce beta parameterization to make the search tractable and fine-grained execution trace feedback to improve discovery efficiency by helping the agent diagnose why a TTS program fails. Experiments on mathematical reasoning benchmarks show that the discovered strategies improve the overall accuracy--cost tradeoff over strong manually designed baselines. The discovered strategies generalize to held-out benchmarks and model scales, while the entire discovery costs only $39.9 and 160 minutes. Our data, and code will be open-source at https://github.com/zhengkid/AutoTTS.

Original Article

View Cached Full Text

Cached at: 05/11/26, 02:42 AM

Paper page - LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

Source: https://huggingface.co/papers/2605.08083 Published on May 8

#2 Paper of the day Authors:

Abstract

AutoTTS automates test-time scaling strategy discovery by formulating it as controller synthesis over reasoning trajectories and probe signals, achieving improved accuracy-cost tradeoffs with minimal computational overhead.

Test-time scaling(TTS) has become an effective approach for improving large language model performance by allocating additional computation during inference. However, existing TTS strategies are largely hand-crafted: researchers manually design reasoning patterns and tune heuristics by intuition, leaving much of the computation-allocation space unexplored. We propose an environment-driven framework, AutoTTS, that changes what researchers design: from individual TTS heuristics to environments where TTS strategies can be discovered automatically. The key to AutoTTS lies in environment construction: the discovery environment must make the control space tractable and provide cheap, frequent feedback for TTS search. As a concrete instantiation, we formulate width--depth TTS ascontroller synthesisover pre-collectedreasoning trajectoriesandprobe signals, where controllers decide when to branch, continue, probe, prune, or stop and can be evaluated cheaply without repeated LLM calls. We further introducebeta parameterizationto make the search tractable andfine-grained execution trace feedbackto improve discovery efficiency by helping the agent diagnose why a TTS program fails. Experiments on mathematical reasoning benchmarks show that the discovered strategies improve the overall accuracy--cost tradeoff over strong manually designed baselines. The discovered strategies generalize to held-out benchmarks and model scales, while the entire discovery costs only $39.9 and 160 minutes. Our data, and code will be open-source at https://github.com/zhengkid/AutoTTS.

View arXiv page View PDF Project page GitHub16 Add to collection

Get this paper in your agent:

hf papers read 2605\.08083

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.08083 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.08083 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.08083 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

Paper page - LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Agentic Test-Time Scaling (GitHub Repo)

LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive

@ihtesham2005: If you still think AI agents can't do real research, this paper will end that argument. Researchers from Google and Met…

Researchers let AI Agents Optimize LLM Reasoning and Cut Tokens by 70%

Submit Feedback

Similar Articles

Agentic Test-Time Scaling (GitHub Repo)

LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive

@ihtesham2005: If you still think AI agents can't do real research, this paper will end that argument. Researchers from Google and Met…

Researchers let AI Agents Optimize LLM Reasoning and Cut Tokens by 70%