Tag
This paper proposes STHTD-MP, a behavior-induced Mirror-Prox temporal-difference method for faster off-policy prediction in reinforcement learning. It replaces the covariance metric with the behavior-policy Bellman matrix and provides convergence analysis and experimental comparisons.
This paper develops a sign-separated finite-time error analysis for constant step-size Q-learning, decomposing the error into negative and positive parts and providing bounds that reveal an asymmetry related to overestimation.
This paper presents a switching-system theory for Q-learning with linear function approximation, using joint spectral radius to analyze convergence stability under deterministic, i.i.d., and Markovian observations.
This paper addresses an open problem in reinforcement learning by providing a counterexample showing that differential temporal difference learning can diverge when using a global clock, despite converging with a local clock, in average-reward settings.