divergence-regularization

Tag

Cards List
#divergence-regularization

Rethinking the Divergence Regularization in LLM RL

Hugging Face Daily Papers · 2026-06-08 Cached

This paper introduces DRPO, which replaces the hard mask in DPPO with a smooth advantage-weighted quadratic regularizer to improve stability and efficiency in LLM reinforcement learning by providing continuous gradient corrections beyond trust-region boundaries.

0 favorites 0 likes
← Back to home

Submit Feedback