DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off
Summary
DiPO introduces a novel reinforcement learning approach for LLMs that uses perplexity-based sample partitioning to disentangle exploration and exploitation subspaces, combined with a bidirectional reward allocation mechanism for more stable policy optimization. The method demonstrates superior performance on mathematical reasoning and function calling tasks.
View Cached Full Text
Cached at: 04/21/26, 07:21 AM
Paper page - DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off
Source: https://huggingface.co/papers/2604.13902 Authors:
,
,
,
,
,
,
,
,
,
,
Abstract
A novel reinforcement learning approach for large language models that addresses the exploration-exploitation trade-off through perplexity-based sample partitioning and bidirectional reward allocation mechanisms.
Reinforcement Learningwith Verifiable Rewards (RLVR) has catalyzed significant advances in the reasoning capabilities ofLarge Language Models(LLMs). However, effectively managing the exploration and exploitation trade-off remains a critical challenge. In this paper, we fully analyze the exploration and exploitation dilemma of extremely hard and easy samples during the training and propose a new fine-grained trade-off mechanism. Concretely, we introduce aperplexity spacedisentangling strategythat divides the sample space into distinct exploration (high perplexity) and exploitation (low perplexity) subspaces, thereby mining fine-grained samples requiringexploration-exploitation trade-off. Subsequently, we propose abidirectional reward allocationmechanism with a minimum impact on verification rewards to implement perplexity-guided exploration and exploitation, enabling more stablepolicy optimization. Finally, we have evaluated our method on two mainstream tasks:mathematical reasoningandfunction calling, and experimental results demonstrate the superiority of the proposed method, confirming its effectiveness in enhancing LLM performance by fine-grainedexploration-exploitation trade-off.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2604\.13902
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.13902 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.13902 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.13902 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models
Introduces LambdaPO, a novel reinforcement learning framework that improves upon GRPO by decomposing advantage estimation into pairwise preference comparisons and adding a semantic density reward, achieving better performance on math reasoning tasks.
ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization
Introduces ODRPO, a framework that decomposes discrete rewards into ordinal binary indicators to improve robustness of policy optimization in RLAIF for LLMs, achieving up to 14.8% relative improvement with minimal overhead.
Diffusion Policy Optimization without Drifting Apart
DiPOD stabilizes diffusion policy optimization by interleaving self-distillation with policy-gradient updates to maintain a tight ELBO, preventing the double-drift phenomenon and achieving higher rewards in both language and continuous control tasks.
GD^2PO: Mitigating Multi-Reward Conflicts via Group-Dynamic reward-Decoupled Policy Optimization
GD^2PO introduces a conflict-aware filtering mechanism to mitigate multi-reward conflicts in reinforcement learning for large language models, preventing signal cancellation and accelerating training efficiency.
$\xi$-DPO: Direct Preference Optimization via Ratio Reward Margin
This paper introduces xi-DPO, a novel preference optimization method that reformulates the objective to minimize distance to optimal ratio reward margins, addressing hyperparameter tuning challenges in SimPO. Experimental results show that xi-DPO outperforms existing methods on open benchmarks.