Tag
Researchers present a practical variance reduction framework combining post-stratification with CUPED for heavy-tailed monetization metrics in ranking experiments, deployed at ShareChat to achieve equivalent statistical confidence with 45% less traffic. The paper is accepted at SIGIR 2026.
GRZO is a novel zeroth-order optimization method for fine-tuning large language models that reduces variance by using group-relative normalization, achieving better accuracy and memory efficiency compared to MeZO.
Proposes a variance-reduced zeroth-order Langevin sampling method for non-log-concave distributions, establishing the first non-asymptotic convergence guarantees, and applies it to inverse problems with score-based generative priors.
This paper provides a refined theoretical analysis of actor-critic methods with entropy regularization, showing that an exact critic acts as a strong variance reducer and enables sample complexity comparable to deterministic policy gradient, and that with a sufficiently accurate learned critic the benefits are preserved.
This paper presents a unified theoretical framework for stochastic variance-reduced estimation, deriving high-probability bounds via a new Freedman inequality and improving oracle complexities for constrained optimization.
This paper studies nonconvex stochastic optimization under Blum-Gladyshev noise, where gradient variance grows with distance from initialization. It proves convergence guarantees for normalized SGD with momentum and a variance-reduced STORM method, achieving minimax optimal rates under certain conditions.
This paper identifies vulnerabilities in the AIVAT variance reduction technique when the heuristic value function is not fixed prior to evaluation, and shows how to propagate heuristic uncertainty to further reduce variance, achieving a 43% reduction in the number of samples needed for statistical conclusions.
This paper introduces Path-Coupled Bellman Flows (PCBF), a continuous-time distributional reinforcement learning method that uses flow matching to model return distributions without heuristic projections. It addresses boundary mismatch and high-variance issues in previous flow-based approaches by coupling current and successor return flows through shared base noise.
Proposes vOPD, which stabilizes on-policy distillation for LLMs by introducing a control variate baseline from reinforcement learning, achieving performance comparable to expensive full-vocabulary methods at lower computational cost.
This paper introduces POISE, a method for stable policy optimization in large reasoning models by estimating baselines using the model's own internal states, reducing computational overhead compared to PPO and GRPO.
OpenAI researchers derive a bias-free action-dependent baseline for variance reduction in policy gradient methods, demonstrating improved learning efficiency on high-dimensional control tasks, multi-agent, and partially observed environments.