Learning to Refine Hidden States for Reliable LLM Reasoning
Summary
Proposes ReLAR, a reinforcement-guided latent refinement framework that iteratively updates hidden representations in LLMs before decoding, improving reasoning reliability and efficiency compared to chain-of-thought methods.
View Cached Full Text
Cached at: 06/17/26, 05:40 AM
# Learning to Refine Hidden States for Reliable LLM Reasoning
Source: [https://arxiv.org/html/2606.17524](https://arxiv.org/html/2606.17524)
###### Abstract
Large language models show strong reasoning ability, but their internal reasoning process can remain unstable in complex multi\-step settings, where early hidden\-state errors may propagate to incorrect predictions\. We proposeReLAR, a reinforcement\-guided latent refinement framework that iteratively updates hidden representations before decoding\. ReLAR maintains a compact latent reasoning state and uses learned depth and action controllers to adaptively determine both the number and direction of refinement steps\. The controllers are trained with a policy\-gradient objective based on step\-wise likelihood improvement, enabling efficient input\-dependent reasoning without explicit chain\-of\-thought generation\. Experiments on medical, mathematical, multi\-hop reasoning, and open\-ended generation benchmarks show that ReLAR improves accuracy, generation quality, and reasoning stability with substantially lower inference overhead than explicit reasoning baselines\. Code is available at[tongyu0924/Learning\-to\-Refine\-Hidden\-States](https://github.com/tongyu0924/Learning-to-Refine-Hidden-States-for-Reliable-LLM-Reasoning)\.
## 1Introduction
Large language models \(LLMs\) have demonstrated strong capabilities in medical question answering, clinical summarization, and diagnostic reasoning\(Singhalet al\.,[2023](https://arxiv.org/html/2606.17524#bib.bib1); Thirunavukarasuet al\.,[2023](https://arxiv.org/html/2606.17524#bib.bib2); Lucas and et al\.,[2024](https://arxiv.org/html/2606.17524#bib.bib23)\), highlighting their potential as core components of future clinical decision\-support systems\.
However, reliable reasoning remains substantially more challenging in complex, multi\-step settings\. Inputs may be incomplete, heterogeneous, or internally conflicting, where even minor logical inconsistencies can propagate across reasoning steps and lead to incorrect conclusions in high\-stakes environments\(Chen and et al\.,[2025](https://arxiv.org/html/2606.17524#bib.bib24); He and et al\.,[2025](https://arxiv.org/html/2606.17524#bib.bib25)\)\.
In such settings, the problem is often not lack of knowledge, but instability in how the model internally integrates evidence across multiple reasoning steps\. A model may over\-anchor on one salient signal, underweight other relevant information, and drift toward an incorrect conclusion\. Ensuring stable and controllable multi\-step reasoning is therefore critical for reliable deployment\.
A predominant approach for eliciting reasoning in LLMs is explicit reasoning, such as chain\-of\-thought \(CoT\) prompting, which encourages models to generate intermediate reasoning steps in natural language\(Weiet al\.,[2022](https://arxiv.org/html/2606.17524#bib.bib5); Kojimaet al\.,[2022b](https://arxiv.org/html/2606.17524#bib.bib26); Wanget al\.,[2023a](https://arxiv.org/html/2606.17524#bib.bib27); Yao and others,[2024](https://arxiv.org/html/2606.17524#bib.bib20); Shinnet al\.,[2023](https://arxiv.org/html/2606.17524#bib.bib28)\)\. These methods often improve task performance and are widely adopted in medical settings because they appear interpretable\. However, they operate at the level of generated text and do not directly regulate the model’s internal reasoning process\. Prior work has shown that reasoning traces may contain logical gaps or hallucinated content even when final answers appear fluent or correct\(Lyuet al\.,[2023](https://arxiv.org/html/2606.17524#bib.bib29); Lanhamet al\.,[2023](https://arxiv.org/html/2606.17524#bib.bib30)\)\. Moreover, generating long reasoning traces increases inference latency and computational cost, which can limit practicality in time\-sensitive clinical scenarios\.
Figure 1:Comparison of ReLAR and conventional autoregressive reasoning\. ReLAR iteratively refines the hidden state before decoding, while the autoregressive baseline decodes from a fixedh0h\_\{0\}without correction\.This failure mode is illustrated in Figure[1](https://arxiv.org/html/2606.17524#S1.F1)\. In this case, the baseline model decodes from a fixed hidden state, leading it to prematurely commit to a reasoning outcome\. A more reliable reasoning process requires integrating the problem quantities, arithmetic relations, and fractional constraint before committing to an output\. The key difference is not the availability of mathematical knowledge, but whether the model can revise and stabilize its internal representation as evidence is accumulated\.
Recent work has therefore explored latent representation editing and intervention as a mechanism for controlling model reasoning\(Wanget al\.,[2025](https://arxiv.org/html/2606.17524#bib.bib53); Stolfoet al\.,[2025](https://arxiv.org/html/2606.17524#bib.bib54); Helffet al\.,[2026](https://arxiv.org/html/2606.17524#bib.bib55)\)\. Hidden state representations encode structured and semantically meaningful information, and interventions on internal activations can influence model behavior more directly than output\-level supervision\(Menget al\.,[2022](https://arxiv.org/html/2606.17524#bib.bib31); Li and others,[2023](https://arxiv.org/html/2606.17524#bib.bib40)\)\. However, existing latent methods remain limited for complex multi\-step reasoning\. Most focus on static or single\-step interventions and do not support iterative refinement or explicit control of internal consistency across reasoning steps\. As a result, early errors in latent states may persist and accumulate, leading to unstable reasoning trajectories\. Existing methods therefore either operate at the output level without controlling internal reasoning, or act on latent representations without iterative and adaptive refinement, leaving reasoning instability fundamentally unaddressed\.
To address this limitation, we propose an iterative hidden\-state refinement framework that enables reinforcement\-learning\-controlled internal reasoning prior to decoding\. Our method performs a sequence of refinement steps entirely in hidden\-state space, allowing internal representations to be progressively adjusted and stabilized before any output is generated\. A learned controller dynamically determines both the refinement direction and the number of refinement iterations, enabling adaptive allocation of reasoning depth based on task difficulty\. By intervening directly in hidden representations before decoding, our framework provides explicit and fine\-grained control over internal reasoning trajectories that is not achievable through output\-level reasoning supervision alone\.
Our contributions can be summarized as follows\. \(1\) We propose an iterative hidden\-state refinement framework that enables direct control over internal reasoning trajectories prior to decoding\. \(2\) We introduce reinforcement\-learning\-based controllers that dynamically modulate refinement direction and reasoning depth, allowing adaptive allocation of internal reasoning\. \(3\) Experiments on clinical reasoning benchmarks demonstrate improved step\-level coherence and overall reliability, while achieving lower inference\-time overhead than explicit reasoning\-based baselines\.
## 2Related Work
### 2\.1Implicit Reasoning in Large Language Models
Large language models \(LLMs\) can perform complex reasoning not only through explicit natural\-language rationales, but also through implicit computation within their internal representations\. While chain\-of\-thought prompting elicits intermediate reasoning steps in text\(Weiet al\.,[2022](https://arxiv.org/html/2606.17524#bib.bib5); Kojimaet al\.,[2022a](https://arxiv.org/html/2606.17524#bib.bib6)\), recent studies suggest that models may encode task\-relevant reasoning information in hidden states even when such reasoning is not explicitly verbalized\(Schlaget al\.,[2021](https://arxiv.org/html/2606.17524#bib.bib46); Geva and others,[2021](https://arxiv.org/html/2606.17524#bib.bib48)\)\.
Implicit reasoning is attractive because it avoids the cost and potential unfaithfulness of long textual rationales, while still allowing the model to integrate evidence before producing an answer\. However, standard LLM inference usually relies on a single forward pass, leaving the implicit reasoning process largely uncontrolled\. As a result, hidden representations may encode incomplete or unstable reasoning states before decoding\.
Our work builds on this view by treating reasoning as an internal latent process that can be refined before generation\. Instead of requiring the model to expose all intermediate steps in text, we iteratively update hidden representations, enabling implicit reasoning to be stabilized and controlled prior to decoding\.
### 2\.2Latent Reasoning and Representation\-Level Refinement
Reasoning in language models is commonly elicited through chain\-of\-thought \(CoT\) prompting\(Weiet al\.,[2022](https://arxiv.org/html/2606.17524#bib.bib5)\)and its extensions, including self\-consistency\(Wanget al\.,[2023b](https://arxiv.org/html/2606.17524#bib.bib7)\)and tree\-structured exploration\(Yao and others,[2024](https://arxiv.org/html/2606.17524#bib.bib20)\)\. These methods operate at the level of generated text and require explicit production of intermediate reasoning traces, which can be unstable and computationally expensive, particularly in clinical settings\.
Recent work explores latent reasoning, where multi\-step inference occurs within hidden\-state space rather than through generated tokens\(Schlaget al\.,[2021](https://arxiv.org/html/2606.17524#bib.bib46); Creswellet al\.,[2023](https://arxiv.org/html/2606.17524#bib.bib49)\)\. Prior approaches study hidden\-state editing or activation refinement, but typically rely on predefined or heuristic interventions and lack principled control over internal reasoning dynamics\(Elazar and others,[2021](https://arxiv.org/html/2606.17524#bib.bib47); Geva and others,[2021](https://arxiv.org/html/2606.17524#bib.bib48)\), which is particularly problematic for reliable medical reasoning\. In contrast, our method directly intervenes in hidden representations, enabling explicit and fine\-grained control over reasoning dynamics, including refinement depth and direction\.
### 2\.3Reinforcement Learning for Adaptive Reasoning Control
Reinforcement learning \(RL\) has been widely adopted for policy optimization, reward shaping, and adaptive computation in large\-scale language systems\(Ouyanget al\.,[2022](https://arxiv.org/html/2606.17524#bib.bib8); Baiet al\.,[2022](https://arxiv.org/html/2606.17524#bib.bib10); Rafailovet al\.,[2023](https://arxiv.org/html/2606.17524#bib.bib11)\)\. Depth\-adaptive mechanisms such as Adaptive Computation Time \(ACT\)\(Graves,[2016](https://arxiv.org/html/2606.17524#bib.bib22)\)and dynamic execution policies demonstrate the benefits of allocating variable computation based on input complexity\. RL has also been applied to control representation\-level interventions in model editing and modular architectures\.
However, these approaches are not designed to stabilize multi\-step reasoning in high\-stakes domains such as medicine\. In contrast, our work leverages reinforcement learning to directly control latent reasoning dynamics, training dedicated controllers that adaptively select both refinement depth and refinement direction\(Turner and others,[2023](https://arxiv.org/html/2606.17524#bib.bib41); Menget al\.,[2022](https://arxiv.org/html/2606.17524#bib.bib31)\)\. This design enables input\-dependent internal reasoning while maintaining stability and efficiency, which is critical for reliable medical decision support\.
Figure 2:Overview of the model pipeline\.Figure 3:Iterative latent\-state refinement
## 3Methodology
We introduceReLAR \(Reinforcement\-Guided Latent Refinement\), an iterative hidden\-state refinement framework that enables controllable, multi\-step reasoning entirely within the latent space of a pretrained language model\. Rather than producing an answer from a single forward pass, ReLAR executes a sequence of representation\-refinement steps before decoding, guided by two learned controllers that adaptively determine*how deeply*and*in which direction*the hidden state should be revised\. Figure[2](https://arxiv.org/html/2606.17524#S2.F2)gives an overview of the full pipeline\.
### 3\.1Preliminaries
Letx=\(x1,…,xn\)x=\(x\_\{1\},\\dots,x\_\{n\}\)be an input token sequence and letpθp\_\{\\theta\}denote a transformer\-based language model that mapsxxto output distributions via hidden representations\. A single forward pass throughpθp\_\{\\theta\}yields a final\-layer hidden stateh0∈ℝL×Dh\_\{0\}\\in\\mathbb\{R\}^\{L\\times D\}, whereLLis the sequence length andDDis the hidden dimension\.
Our framework augmentspθp\_\{\\theta\}with a low\-dimensionalreasoning statest∈ℝdss\_\{t\}\\in\\mathbb\{R\}^\{d\_\{s\}\}that tracks the model’s evolving internal latent representation of the input\. Unlike chain\-of\-thought rationales,sts\_\{t\}is never decoded into text; it serves exclusively as an internal control signal\. Starting from\(h0,s0\)\(h\_\{0\},s\_\{0\}\), the framework produces coupled refinement trajectories
s0→s1→⋯→sT,h0→h1→⋯→hT,s\_\{0\}\\;\\to\\;s\_\{1\}\\;\\to\\;\\cdots\\;\\to\\;s\_\{T\},\\qquad h\_\{0\}\\;\\to\\;h\_\{1\}\\;\\to\\;\\cdots\\;\\to\\;h\_\{T\},where the refinement depthTTis selected adaptively by a learned depth controller, as described in Section[3\.4](https://arxiv.org/html/2606.17524#S3.SS4)\.
### 3\.2Initial Reasoning State
Given the base hidden representationh0h\_\{0\}, we construct the initial reasoning state via a lightweight projection networkfextractf\_\{\\mathrm\{extract\}\}:
s0=fextract\(h0\)\.s\_\{0\}\\;=\\;f\_\{\\mathrm\{extract\}\}\(h\_\{0\}\)\.The vectors0∈ℝdss\_\{0\}\\in\\mathbb\{R\}^\{d\_\{s\}\}compresses task\-relevant information from the final transformer layer into a compact representation\. It acts as a*shared bottleneck*: both the depth controllerπd\\pi\_\{d\}and the action controllerπa\\pi\_\{a\}consumes0s\_\{0\}as their sole input, so all downstream refinement decisions are governed by this single summary\.
### 3\.3Action\-Guided Representation Refinement
At each refinement stept∈\{0,…,T−1\}t\\in\\\{0,\\ldots,T\-1\\\}, the action controller predicts a refinement direction and two modulation parameters:
at=\(γt,βt,𝒗t\)∼πa\(at∣st\),a\_\{t\}=\(\\gamma\_\{t\},\\beta\_\{t\},\\bm\{v\}\_\{t\}\)\\sim\\pi\_\{a\}\(a\_\{t\}\\mid s\_\{t\}\),where𝒗t∈ℝD\\bm\{v\}\_\{t\}\\in\\mathbb\{R\}^\{D\}is normalized to satisfy‖𝒗t‖=1\\\|\\bm\{v\}\_\{t\}\\\|=1\. The scalar parametersγt\\gamma\_\{t\}andβt\\beta\_\{t\}are used to compute an effective signed step size
αt=fα\(γt,βt\),\\alpha\_\{t\}=f\_\{\\alpha\}\(\\gamma\_\{t\},\\beta\_\{t\}\),which determines the magnitude and sign of the update\. The hidden representation is then refined by an additive perturbation:
ht\+1=ht\+αt𝒗t\.h\_\{t\+1\}=h\_\{t\}\+\\alpha\_\{t\}\\bm\{v\}\_\{t\}\.Equivalently, this additive update defines the refinement function
frefine\(ht,st,γt,βt,𝒗t\)=ht\+fα\(γt,βt\)𝒗t\.f\_\{\\mathrm\{refine\}\}\(h\_\{t\},s\_\{t\},\\gamma\_\{t\},\\beta\_\{t\},\\bm\{v\}\_\{t\}\)=h\_\{t\}\+f\_\{\\alpha\}\(\\gamma\_\{t\},\\beta\_\{t\}\)\\bm\{v\}\_\{t\}\.Thus,𝒗t\\bm\{v\}\_\{t\}determines the direction of refinement, whileαt\\alpha\_\{t\}determines how far the representation moves along that direction\.
The reasoning state is then refreshed to reflect the updated hidden state:
st\+1=g\(st,ht\+1\)\.s\_\{t\+1\}=g\(s\_\{t\},h\_\{t\+1\}\)\.Iterating this procedure forTTsteps allows the model to progressively stabilize its internal representation without producing any intermediate tokens\.
After all refinement steps, the final reasoning statesTs\_\{T\}is realized into a decoding representation that is anchored to the original encodingh0h\_\{0\}:
hT=fdecode\(sT,h0\),y^∼pθ\(y∣x,hT\)\.h\_\{T\}\\;=\\;f\_\{\\mathrm\{decode\}\}\(s\_\{T\},\\;h\_\{0\}\),\\qquad\\hat\{y\}\\;\\sim\\;p\_\{\\theta\}\(y\\mid x,\\;h\_\{T\}\)\.Anchoring toh0h\_\{0\}preserves the full context of the input while allowing the refinement trajectory to adjust task\-relevant features\.
### 3\.4Reinforcement\-Learning Controller Optimization
#### Adaptive depth\.
Applying a fixed number of refinement steps uniformly ignores the varying complexity of different inputs\. We therefore introduce a depth controllerπd\\pi\_\{d\}that selects the number of refinement steps from the initial reasoning state:
T∼πd\(T∣s0\)\.T\\;\\sim\\;\\pi\_\{d\}\(T\\mid s\_\{0\}\)\.This allows the model to invest more computation on difficult examples and skip unnecessary steps for simpler ones\.
#### Step\-wise reward\.
To provide dense credit assignment across refinement steps, we define a per\-step improvement in terms of the ground\-truth log\-likelihood\. For any reasoning statests\_\{t\}, we evaluate its quality by realizing it and scoring the target outputy∗y^\{\*\}:
logpθ\(y∗∣x,st\)≜logpθ\(y∗∣x,fdecode\(st,h0\)\)\.\\log p\_\{\\theta\}\(y^\{\*\}\\mid x,s\_\{t\}\)\\;\\triangleq\\;\\log p\_\{\\theta\}\\\!\\left\(y^\{\*\}\\mid x,\\;f\_\{\\mathrm\{decode\}\}\(s\_\{t\},h\_\{0\}\)\\right\)\.The per\-step improvement and reward are then
Δt\\displaystyle\\Delta\_\{t\}=logpθ\(y∗∣x,st\+1\)−logpθ\(y∗∣x,st\),\\displaystyle=\\log p\_\{\\theta\}\(y^\{\*\}\\mid x,s\_\{t\+1\}\)\-\\log p\_\{\\theta\}\(y^\{\*\}\\mid x,s\_\{t\}\),rt\\displaystyle r\_\{t\}=Δt−cd\.\\displaystyle=\\Delta\_\{t\}\-c\_\{d\}\.wherecd\>0c\_\{d\}\>0is a fixed per\-step computation cost that discourages over\-refinement\.
#### Shaped return and RL objective\.
Fort=0,1,…,T−1t=0,1,\\dots,T\{\-\}1we define the shaped step return
Rt=rtt\+1\+1T\(−βKL\+λH\),R\_\{t\}\\;=\\;\\frac\{r\_\{t\}\}\{t\+1\}\\;\+\\;\\frac\{1\}\{T\}\\\!\\left\(\-\\beta\\,\\mathrm\{KL\}\\;\+\\;\\lambda H\\right\),where the1/\(t\+1\)1/\(t\+1\)discount reduces credit for later steps, and the second term applies a trajectory\-level KL penalty \(to prevent policy collapse\) and entropy bonusHH\(to encourage exploration\)\. The trajectory\-level return isr=∑t=0T−1Rtr=\\sum\_\{t=0\}^\{T\-1\}R\_\{t\}\.
Assuming conditional independence, the joint log\-policy factorizes as
logπ\(T,a0:T−1∣s0\)=logπd\(T∣s0\)\+∑t=0T−1logπa\(at∣st\)\.\\log\\pi\(T,a\_\{0:T\-1\}\\mid s\_\{0\}\)\\;=\\;\\log\\pi\_\{d\}\(T\\mid s\_\{0\}\)\\;\+\\;\\sum\_\{t=0\}^\{T\-1\}\\log\\pi\_\{a\}\(a\_\{t\}\\mid s\_\{t\}\)\.The policy\-gradient RL loss is
ℒRL\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{RL\}\}=−𝔼\[A⋅\(logπd\(T∣s0\)\+∑t=0T−1logπa\(at∣st\)\)\]\\displaystyle=\-\\,\\mathbb\{E\}\\\!\\left\[A\\cdot\\left\(\\log\\pi\_\{d\}\(T\\mid s\_\{0\}\)\\;\+\\;\\sum\_\{t=0\}^\{T\-1\}\\log\\pi\_\{a\}\(a\_\{t\}\\mid s\_\{t\}\)\\right\)\\right\]−λH,\\displaystyle\\quad\-\\lambda H,whereA=r−𝔼\[r\]A=r\-\\mathbb\{E\}\[r\]is the advantage estimate computed with a running baseline\.
### 3\.5Joint Training Objective
After latent refinement, the final decoding representationhTh\_\{T\}is used to optimize a standard teacher\-forced language\-model loss over the target sequencey∗=\(y1∗,…,yM∗\)y^\{\*\}=\(y^\{\*\}\_\{1\},\\dots,y^\{\*\}\_\{M\}\):
ℒLM=−1M∑t=1Mlogpθ\(yt∗∣y<t∗,x,hT\)\.\\mathcal\{L\}\_\{\\mathrm\{LM\}\}\\;=\\;\-\\frac\{1\}\{M\}\\sum\_\{t=1\}^\{M\}\\log p\_\{\\theta\}\\\!\\left\(y^\{\*\}\_\{t\}\\mid y^\{\*\}\_\{<t\},\\;x,\\;h\_\{T\}\\right\)\.This objective drives the refined representation to be both task\-relevant and generation\-compatible\.
The full training objective jointly optimizes the base language model and both controllers:
ℒtotal=ℒLM\+αRLℒRL,\\boxed\{\\mathcal\{L\}\_\{\\mathrm\{total\}\}\\;=\\;\\mathcal\{L\}\_\{\\mathrm\{LM\}\}\\;\+\\;\\alpha\_\{\\mathrm\{RL\}\}\\,\\mathcal\{L\}\_\{\\mathrm\{RL\}\},\}whereαRL\\alpha\_\{\\mathrm\{RL\}\}balances language modeling fidelity against the quality of learned refinement policies\. Through this joint objective, the model learns to perform stable, controllable reasoning entirely in latent space before committing to any output token\.
Table 1:Performance Comparison on Four Reasoning Datasets under Different Shot SettingsTable 2:Open\-ended generation performance across general\-domain generation tasks\.Table 3:Ablation study on refinement strategies across four reasoning datasets\.Table 4:Accuracy–latency comparison of inference\-time reasoning strategies on PubMedQA using the Gemma\-2B backbone\. SC\-CoT denotes self\-consistency chain\-of\-thought withn=5n=5sampled reasoning paths\. Time / Ours denotes the average inference time of each method divided by that of our method\.Table 5:Comparison between SFT and latent refinement on PubMedQA\.
## 4Experiments
We evaluate our hidden\-state refinement framework across medical, mathematical, multi\-hop, and open\-ended generation tasks\. Our evaluation focuses on the following questions: \(1\) Can latent refinement improve end\-task performance across diverse reasoning settings? \(2\) Does reinforcement learning over refinement depth and direction provide additional gains over standard supervised fine\-tuning? \(3\) How do different refinement design choices affect stability, accuracy, and generation quality?
### 4\.1Tasks and Datasets
#### PubMedQA\.
PubMedQA\(Jinet al\.,[2019](https://arxiv.org/html/2606.17524#bib.bib36)\)is a biomedical QA benchmark requiring yes/no/maybe answers over PubMed abstracts\. It tests causal and quantitative reasoning over scientific literature, serving as a proxy for evidence\-based clinical reasoning\. We report accuracy and macro\-F1\.
#### GSM8K and GSM\-Hard\.
GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.17524#bib.bib56)\)is a grade\-school math benchmark requiring multi\-step arithmetic reasoning\. GSM\-Hard extends GSM8K with more challenging numerical problems\. We include both datasets to evaluate generalization beyond the medical domain, reporting accuracy and pass@5\.
#### HotpotQA\.
HotpotQA\(Yanget al\.,[2018](https://arxiv.org/html/2606.17524#bib.bib52)\)is a multi\-hop QA benchmark requiring reasoning across multiple documents to reach a final answer\. We include it to assess general multi\-hop reasoning capability, reporting accuracy and F1\.
#### Open\-ended generation tasks\.
To evaluate whether latent refinement improves free\-form generation quality beyond short\-answer prediction, we further consider two general\-domain open\-ended generation tasks: CommonGen\(Linet al\.,[2020](https://arxiv.org/html/2606.17524#bib.bib57)\)and WritingPrompts\(Fanet al\.,[2018](https://arxiv.org/html/2606.17524#bib.bib58)\)\. CommonGen requires models to generate a coherent sentence that covers a set of given concepts, while WritingPrompts evaluates longer\-form story generation from natural\-language prompts\. We report BERTScore and ROUGE\-L\.
### 4\.2Models and Baselines
#### Backbone models\.
Our method is implemented as a lightweight refinement module on top of three open\-source backbones of different sizes and families: LLaMA\-1\.1B, Gemma\-2B, and Qwen\-3B\. Unless otherwise specified, all ablation studies in Table[3](https://arxiv.org/html/2606.17524#S3.T3)use the Gemma\-2B backbone for consistency\.
#### General\-purpose LLM baselines\.
Table[1](https://arxiv.org/html/2606.17524#S3.T1)compares our method with a range of general\-purpose LLM baselines, including LLaMA\-2\-7B\(Touvronet al\.,[2023](https://arxiv.org/html/2606.17524#bib.bib61)\), Mistral\-7B\-v0\.3\(Jianget al\.,[2023](https://arxiv.org/html/2606.17524#bib.bib62)\), Falcon\-7B\(Almazroueiet al\.,[2023](https://arxiv.org/html/2606.17524#bib.bib63)\), Gemma\-7B\(Gemma Team,[2024](https://arxiv.org/html/2606.17524#bib.bib64)\), Mistral\-7B\-Instruct, Llama\-3\-8B\-Instruct, and Qwen2\.5\-7B\(Qwen Team,[2025](https://arxiv.org/html/2606.17524#bib.bib65)\)\. These models provide a broad comparison across general reasoning and generation settings\.
#### Medical LLM baselines\.
For biomedical and clinical reasoning evaluation, we additionally compare against strong medical LLM baselines, including Qwen2\.5\-Med\-7B, Med42\-Mistral and Med42\-Llama3\(Christophe and others,[2024](https://arxiv.org/html/2606.17524#bib.bib43)\), and MedGemma\-4B\(Sellergren and others,[2025](https://arxiv.org/html/2606.17524#bib.bib44)\)\. These baselines represent open medical LLMs adapted from Qwen, Mistral, LLaMA, and Gemma architectures\.
### 4\.3Evaluation Protocol
For PubMedQA, we follow the standard three\-way classification setting and report accuracy and macro\-F1\. For GSM8K and GSM\-Hard, we report accuracy and pass@5 to evaluate both direct correctness and sampled solution quality\. For HotpotQA, we report accuracy and F1 to measure multi\-hop answer correctness\. For open\-ended generation, we compute BERTScore\(Zhanget al\.,[2020](https://arxiv.org/html/2606.17524#bib.bib59)\)and ROUGE\-L\(Lin,[2004](https://arxiv.org/html/2606.17524#bib.bib60)\)against reference outputs to assess semantic similarity and sequence\-level overlap, respectively\. All reported numbers are averaged over three random seeds; we use the same decoding temperature and maximum generation length for all models to ensure comparability\.
### 4\.4Main Results
Table[1](https://arxiv.org/html/2606.17524#S3.T1)summarizes the overall performance across four reasoning datasets under both 0\-shot and 5\-shot settings\. On PubMedQA, our method achieves the best performance among all compared models, reaching 77\.67% accuracy and 72\.54 macro\-F1 in the 0\-shot setting, and further improving to 79\.23% accuracy and 74\.17 macro\-F1 with 5\-shot prompting\. These results indicate that latent refinement is particularly effective for biomedical reasoning, where stable integration of evidence is important\.
Beyond PubMedQA, ReLAR also shows strong performance on mathematical and multi\-hop reasoning tasks\. On GSM\-Hard, our method achieves the best accuracy and pass@5 under both 0\-shot and 5\-shot settings, suggesting that refinement in hidden\-state space improves robustness on more challenging arithmetic problems\. On HotpotQA, ReLAR also obtains the best accuracy and F1 across both settings, demonstrating that the proposed refinement mechanism generalizes to multi\-hop question answering\. On GSM8K, while Llama\-3\-8B\-Instruct obtains the highest accuracy, our method remains competitive and achieves strong pass@5 performance\. Overall, these results show that latent refinement improves reasoning reliability across multiple task types rather than being limited to a single domain\.
Table[2](https://arxiv.org/html/2606.17524#S3.T2)further evaluates open\-ended generation quality on CommonGen and WritingPrompts\. Our method achieves the highest BERTScore and ROUGE\-L across both tasks, indicating that iterative latent refinement improves not only short\-answer accuracy but also free\-form generation quality\. These results suggest that refining hidden representations before decoding can produce outputs that are more semantically aligned with reference generations\.
### 4\.5Ablation Studies
To understand which components are most important, we perform ablations summarized in Table[3](https://arxiv.org/html/2606.17524#S3.T3)\. Removing refinement entirely \(“No Refinement”\) causes a large drop across reasoning datasets, confirming that the base LM alone cannot fully exploit the supervision signal\. Static refinement with a fixed depth recovers part of the performance gain, but remains less effective than adaptive refinement\. Using adaptive depth without direction or adaptive direction without depth also improves over the no\-refinement baseline, showing that both components contribute to the final result\. The full model, which combines adaptive depth and adaptive direction, achieves the strongest overall performance\.
Table[5](https://arxiv.org/html/2606.17524#S3.T5)further compares SFT and our method across three backbones on PubMedQA\. In all cases, adding our refinement module yields consistent improvements in both accuracy and macro\-F1, suggesting that reinforcement\-guided latent refinement provides complementary benefits on top of standard supervised tuning\.
We also analyze the behavior of the learned depth controller in Figure[4](https://arxiv.org/html/2606.17524#S4.F4)\. The controller assigns different refinement depths to different inputs, indicating that ReLAR does not apply a fixed amount of computation uniformly\. Harder examples tend to receive more refinement steps, and the highest accuracy is observed at depth 3\. This supports the motivation for adaptive depth: different inputs require different amounts of internal reasoning before decoding\.
In addition to accuracy, Figure[5](https://arxiv.org/html/2606.17524#S4.F5)compares the inference\-time cost of different reasoning strategies\. Although chain\-of\-thought and self\-consistency chain\-of\-thought introduce substantial inference overhead, ReLAR remains close to standard SFT inference while avoiding the high cost of explicit reasoning traces\. This suggests that latent refinement provides a more efficient alternative to explicit chain\-of\-thought generation\.
Finally, Figure[6](https://arxiv.org/html/2606.17524#S4.F6)visualizes the effect of iterative latent refinement across refinement steps\. ReLAR progressively improves the refinement trajectory, while the fixed\-hidden\-state baseline remains nearly unchanged\. This provides additional evidence that the proposed refinement process stabilizes the internal representation before generation\.
Figure 4:Depth distribution of the learned depth controller evaluated on GSM\-Hard \(projectedn=200n=200, based on 20 actual samples\)\. The controller allocates more refinement steps to harder inputs, with depth 3 achieving the highest accuracy \(75%\) compared to depth 2 \(12%\)\.Figure 5:Inference time relative to ReLAR\. CoT and SC\-CoT incur 65×\\timesand 117×\\timesoverhead respectively\.Figure 6:Latent refinement across refinement steps\. ReLAR progressively improves while the baseline remains flat\.
## 5Conclusion
We presented ReLAR, a reinforcement\-guided latent refinement framework that enables controllable multi\-step reasoning entirely within the hidden\-state space of a pretrained language model\. Rather than relying on explicit chain\-of\-thought generation, ReLAR iteratively refines internal representations prior to decoding, guided by learned depth and action controllers trained with a policy\-gradient objective\. Experiments across medical, mathematical, multi\-hop, and open\-ended generation benchmarks demonstrate that ReLAR consistently improves accuracy and generation quality over strong general\-purpose and medical LLM baselines, while achieving substantially lower inference overhead than explicit reasoning approaches\. Ablation studies further confirm that both adaptive depth and adaptive direction contribute to the overall performance, and that reinforcement\-guided refinement provides complementary benefits beyond standard supervised fine\-tuning\.
## Limitations
Despite promising results, ReLAR has several limitations\. First, our backbone models \(LLaMA\-1\.1B, Gemma\-2B, Qwen\-3B\) are smaller than the 7B baselines in our comparison, which may limit the direct comparability of results\. Second, the reinforcement learning training requires per\-step likelihood evaluation against ground\-truth labels, making the framework dependent on supervised signal and potentially less applicable to purely unsupervised settings\. Third, while latent refinement reduces inference overhead compared to chain\-of\-thought approaches, the iterative hidden\-state updates still introduce additional parameters and training complexity relative to standard fine\-tuning\. Finally, the internal refinement process operates entirely in latent space and is not directly interpretable, which may limit applicability in settings where reasoning transparency is required, such as high\-stakes clinical decision support\.
## Appendix AAdditional Theoretical Analysis
### A\.1Adaptive Depth Dominates Fixed Depth
We provide a simple justification for using an adaptive depth controller\. LetR\(x,T\)R\(x,T\)denote the expected refinement return for inputxxwhen usingTTrefinement steps\. A fixed\-depth policy selects the same depthT0T\_\{0\}for all inputs:
Jfixed\(T0\)=𝔼x∼𝒟\[R\(x,T0\)\]\.J\_\{\\mathrm\{fixed\}\}\(T\_\{0\}\)=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}\}\[R\(x,T\_\{0\}\)\]\.In contrast, an adaptive\-depth policy can choose an input\-dependent depthT\(x\)∈𝒯T\(x\)\\in\\mathcal\{T\}:
Jadapt=𝔼x∼𝒟\[maxT∈𝒯R\(x,T\)\]\.J\_\{\\mathrm\{adapt\}\}=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}\}\\left\[\\max\_\{T\\in\\mathcal\{T\}\}R\(x,T\)\\right\]\.For any fixedT0∈𝒯T\_\{0\}\\in\\mathcal\{T\}, we have
maxT∈𝒯R\(x,T\)≥R\(x,T0\),\\max\_\{T\\in\\mathcal\{T\}\}R\(x,T\)\\geq R\(x,T\_\{0\}\),and therefore
Jadapt≥Jfixed\(T0\)\.J\_\{\\mathrm\{adapt\}\}\\geq J\_\{\\mathrm\{fixed\}\}\(T\_\{0\}\)\.Since this holds for every fixed depthT0T\_\{0\}, it follows that
Jadapt≥maxT0∈𝒯Jfixed\(T0\)\.J\_\{\\mathrm\{adapt\}\}\\geq\\max\_\{T\_\{0\}\\in\\mathcal\{T\}\}J\_\{\\mathrm\{fixed\}\}\(T\_\{0\}\)\.Thus, fixed\-depth refinement is a special case of adaptive\-depth refinement\. This does not guarantee that training will always find the optimal adaptive policy, but it shows that adaptive depth provides a strictly richer policy class and motivates using input\-dependent refinement depth\.
## References
- E\. Almazrouei, H\. Alobeidli, A\. Alshamsi, A\. Cappelli, R\. Cojocaru, M\. Debbah, E\. Goffinet, D\. Hesslow, J\. Launay, Q\. Malartic,et al\.\(2023\)The falcon series of open language models\.arXiv preprint arXiv:2311\.16867\.Cited by:[§4\.2](https://arxiv.org/html/2606.17524#S4.SS2.SSS0.Px2.p1.1)\.
- Y\. Bai, A\. Jones, K\. Ndousse,et al\.\(2022\)Training a helpful and harmless assistant with reinforcement learning from human feedback\.InNeurIPS,Cited by:[§2\.3](https://arxiv.org/html/2606.17524#S2.SS3.p1.1)\.
- X\. Chen and et al\. \(2025\)Evaluating large language models and agents in healthcare\.npj Digital Medicine\.Note:In pressExternal Links:[Document](https://dx.doi.org/10.1016/j.xdss.2025.100123)Cited by:[§1](https://arxiv.org/html/2606.17524#S1.p2.1)\.
- C\. Christopheet al\.\(2024\)Med42: evaluating fine\-tuning strategies for medical large language models\.arXiv preprint\.External Links:2404\.14779Cited by:[§4\.2](https://arxiv.org/html/2606.17524#S4.SS2.SSS0.Px3.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§4\.1](https://arxiv.org/html/2606.17524#S4.SS1.SSS0.Px2.p1.1)\.
- A\. Creswell, M\. Shanahan,et al\.\(2023\)Reasoning with latent thoughts\.arXiv preprint arXiv:2305\.17007\.Cited by:[§2\.2](https://arxiv.org/html/2606.17524#S2.SS2.p2.1)\.
- Y\. Elazaret al\.\(2021\)Amnesic probing: behavioral explanation with amnesic counterfactuals\.InACL,Cited by:[§2\.2](https://arxiv.org/html/2606.17524#S2.SS2.p2.1)\.
- A\. Fan, M\. Lewis, and Y\. Dauphin \(2018\)Hierarchical neural story generation\.InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics,pp\. 889–898\.Cited by:[§4\.1](https://arxiv.org/html/2606.17524#S4.SS1.SSS0.Px4.p1.1)\.
- Gemma Team \(2024\)Gemma: open models based on gemini research and technology\.arXiv preprint arXiv:2403\.08295\.Cited by:[§4\.2](https://arxiv.org/html/2606.17524#S4.SS2.SSS0.Px2.p1.1)\.
- M\. Gevaet al\.\(2021\)Transformer feed\-forward layers are key\-value memories\.InEMNLP,Cited by:[§2\.1](https://arxiv.org/html/2606.17524#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.17524#S2.SS2.p2.1)\.
- A\. Graves \(2016\)Adaptive computation time for recurrent neural networks\.InICML,Cited by:[§2\.3](https://arxiv.org/html/2606.17524#S2.SS3.p1.1)\.
- K\. He and et al\. \(2025\)A survey of large language models for healthcare\.Information Fusion\.Note:ForthcomingExternal Links:[Document](https://dx.doi.org/10.1016/j.inffus.2025.102963)Cited by:[§1](https://arxiv.org/html/2606.17524#S1.p2.1)\.
- L\. Helff, R\. Härle, W\. Stammer, F\. Friedrich, M\. Brack, A\. Wüst, H\. Shindo, P\. Schramowski, and K\. Kersting \(2026\)ActivationReasoning: logical reasoning in latent activation spaces\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=gGJh5AZTG7)Cited by:[§1](https://arxiv.org/html/2606.17524#S1.p6.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. d\. L\. Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier,et al\.\(2023\)Mistral 7b\.arXiv preprint arXiv:2310\.06825\.Cited by:[§4\.2](https://arxiv.org/html/2606.17524#S4.SS2.SSS0.Px2.p1.1)\.
- Q\. Jin, B\. Dhingra, Z\. Liu, W\. W\. Cohen, and X\. Lu \(2019\)PubMedQA: a dataset for biomedical research question answering\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,External Links:1909\.06146Cited by:[§4\.1](https://arxiv.org/html/2606.17524#S4.SS1.SSS0.Px1.p1.1)\.
- T\. Kojima, S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa \(2022a\)Large language models are zero\-shot reasoners\.arXiv preprint arXiv:2205\.11916\.Cited by:[§2\.1](https://arxiv.org/html/2606.17524#S2.SS1.p1.1)\.
- T\. Kojima, S\. S\. Gu, and et al\. \(2022b\)Large language models are zero\-shot reasoners\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2606.17524#S1.p4.1)\.
- T\. Lanham, A\. Askell, and et al\. \(2023\)Measuring faithfulness in chain\-of\-thought reasoning\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2606.17524#S1.p4.1)\.
- Y\. Liet al\.\(2023\)Editing factual knowledge in language models via representation surgery\.arXiv preprint arXiv:2305\.13144\.Cited by:[§1](https://arxiv.org/html/2606.17524#S1.p6.1)\.
- B\. Y\. Lin, W\. Zhou, M\. Shen, P\. Zhou, C\. Bhagavatula, Y\. Choi, and X\. Ren \(2020\)CommonGen: a constrained text generation challenge for generative commonsense reasoning\.InFindings of the Association for Computational Linguistics: EMNLP 2020,pp\. 1823–1840\.Cited by:[§4\.1](https://arxiv.org/html/2606.17524#S4.SS1.SSS0.Px4.p1.1)\.
- C\. Lin \(2004\)ROUGE: a package for automatic evaluation of summaries\.InText Summarization Branches Out,pp\. 74–81\.Cited by:[§4\.3](https://arxiv.org/html/2606.17524#S4.SS3.p1.1)\.
- M\. M\. Lucas and et al\. \(2024\)Reasoning with large language models for medical question answering\.Journal of the American Medical Informatics Association31\(9\),pp\. 1964–1976\.External Links:[Document](https://dx.doi.org/10.1093/jamia/ocae102)Cited by:[§1](https://arxiv.org/html/2606.17524#S1.p1.1)\.
- Q\. Lyu, A\. Stein, and et al\. \(2023\)Faithful chain\-of\-thought reasoning\.InIJCNLP\-AACL,Cited by:[§1](https://arxiv.org/html/2606.17524#S1.p4.1)\.
- K\. Meng, D\. Bau, and Y\. Belinkov \(2022\)Locating and editing factual associations in gpt\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2606.17524#S1.p6.1),[§2\.3](https://arxiv.org/html/2606.17524#S2.SS3.p2.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang,et al\.\(2022\)Training language models to follow instructions with human feedback\.InNeurIPS,Cited by:[§2\.3](https://arxiv.org/html/2606.17524#S2.SS3.p1.1)\.
- Qwen Team \(2025\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§4\.2](https://arxiv.org/html/2606.17524#S4.SS2.SSS0.Px2.p1.1)\.
- R\. Rafailov, A\. Sharma, M\. Chang,et al\.\(2023\)Direct preference optimization: your language model is secretly a reward model\.InNeurIPS,Cited by:[§2\.3](https://arxiv.org/html/2606.17524#S2.SS3.p1.1)\.
- I\. Schlag, K\. Irie, and J\. Schmidhuber \(2021\)Linear transformers are secretly fast weight programmers\.InICML,Cited by:[§2\.1](https://arxiv.org/html/2606.17524#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.17524#S2.SS2.p2.1)\.
- A\. Sellergrenet al\.\(2025\)MedGemma: medical vision\-language foundation models based on gemma 3\.arXiv preprint\.External Links:2507\.05201Cited by:[§4\.2](https://arxiv.org/html/2606.17524#S4.SS2.SSS0.Px3.p1.1)\.
- N\. Shinn, F\. Cassano, and et al\. \(2023\)Reflexion: language agents with verbal reinforcement learning\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2606.17524#S1.p4.1)\.
- K\. Singhal, S\. Azizi, T\. Tu, S\. Mahdavi, M\. Noorbakhsh, A\. Rasouly, V\. Gupta, M\. Ghassemi, V\. Natarajan,et al\.\(2023\)Large language models encode clinical knowledge\.Nature\.Cited by:[§1](https://arxiv.org/html/2606.17524#S1.p1.1)\.
- A\. Stolfo, V\. Balachandran, S\. Yousefi, E\. Horvitz, and B\. Nushi \(2025\)Improving instruction\-following in language models through activation steering\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 55790–55823\.Cited by:[§1](https://arxiv.org/html/2606.17524#S1.p6.1)\.
- A\. J\. Thirunavukarasu, H\. Nori, T\. J\. Hwang,et al\.\(2023\)Large language models in medicine\.Nature Medicine29,pp\. 1939–1951\.Cited by:[§1](https://arxiv.org/html/2606.17524#S1.p1.1)\.
- H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale,et al\.\(2023\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[§4\.2](https://arxiv.org/html/2606.17524#S4.SS2.SSS0.Px2.p1.1)\.
- A\. Turneret al\.\(2023\)Activation steering in large language models\.arXiv preprint arXiv:2301\.09521\.Cited by:[§2\.3](https://arxiv.org/html/2606.17524#S2.SS3.p2.1)\.
- W\. Wang, J\. Yang, and W\. Peng \(2025\)Semantics\-adaptive activation intervention for llms via dynamic steering vectors\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 79334–79351\.Cited by:[§1](https://arxiv.org/html/2606.17524#S1.p6.1)\.
- X\. Wang, J\. Wei, and et al\. \(2023a\)Self\-consistency improves chain\-of\-thought reasoning in language models\.Transactions of the ACL11,pp\. 177–193\.Cited by:[§1](https://arxiv.org/html/2606.17524#S1.p4.1)\.
- X\. Wang, J\. Wei, D\. Schuurmans,et al\.\(2023b\)Self\-consistency improves chain\-of\-thought reasoning\.InICLR,Cited by:[§2\.2](https://arxiv.org/html/2606.17524#S2.SS2.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2606.17524#S1.p4.1),[§2\.1](https://arxiv.org/html/2606.17524#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.17524#S2.SS2.p1.1)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 conference on empirical methods in natural language processing,pp\. 2369–2380\.Cited by:[§4\.1](https://arxiv.org/html/2606.17524#S4.SS1.SSS0.Px3.p1.1)\.
- S\. Yaoet al\.\(2024\)Tree of thoughts: deliberate problem solving with llms\.InICLR,Cited by:[§1](https://arxiv.org/html/2606.17524#S1.p4.1),[§2\.2](https://arxiv.org/html/2606.17524#S2.SS2.p1.1)\.
- T\. Zhang, V\. Kishore, F\. Wu, K\. Q\. Weinberger, and Y\. Artzi \(2020\)BERTScore: evaluating text generation with bert\.InInternational Conference on Learning Representations,Cited by:[§4\.3](https://arxiv.org/html/2606.17524#S4.SS3.p1.1)\.Similar Articles
ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning
This paper introduces ReFlect, a training-free harness system that wraps LLMs with deterministic error detection and recovery logic to improve performance on complex, long-horizon reasoning tasks.
Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs
Introduces Latent Reward Steering (Lrs), an adaptive inference-time framework that uses sparse autoencoder latent states and a learned reward model to implicitly promote cognitive behaviors like verification and backtracking in reasoning LLMs, improving performance across multiple models and benchmarks.
Adaptive Latent Agentic Reasoning
This paper introduces Adaptive Latent Agentic Reasoning (ALAR), a dual-mode framework for LLM agents that uses compact latent reasoning for routine turns and selectively escalates to explicit chain-of-thought for harder decisions, achieving up to 84.6% token reduction while maintaining task accuracy.
LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition
LC-ERD is a framework that mines latent logic from LLM-generated reasoning chains to decompose global rewards into step-level signals, enabling self-evolving reasoning without human annotation. It addresses label noise, coarse supervision, and distributional collapse via variational logic potential and multi-agent value decomposition.
DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning
DeepRefine is a research paper introducing an LLM-based reasoning model that refines agent-compiled knowledge bases using reinforcement learning and multi-turn interactions to improve downstream task performance.