Tag
This paper diagnoses systematic errors in attribution patching, a gradient-based approximation used for causal localization in language models, and proposes a second-order correction using Hessian-vector products that improves reliability with minimal additional computational cost.
This paper develops a sign-separated finite-time error analysis for constant step-size Q-learning, decomposing the error into negative and positive parts and providing bounds that reveal an asymmetry related to overestimation.
This paper analyzes the 'training in imagination' paradigm in model-based reinforcement learning, deriving optimal sample allocation strategies and characterizing how dynamics and reward model errors affect policy returns.