When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?
Summary
This paper develops a statistical theory for offline reinforcement learning from trajectory-level outcome supervision, proposing the OPAC algorithm and characterizing when such supervision enables efficient learning versus when fundamental barriers arise.
View Cached Full Text
Cached at: 06/18/26, 07:58 PM
Paper page - When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?
Source: https://huggingface.co/papers/2606.18531
Abstract
Offline reinforcement learning with trajectory-level outcome supervision presents statistical challenges that can be addressed through pessimistic actor-critic methods, though fundamental barriers exist for certain generalized outcome-based problems.
Offline reinforcement learningis typically analyzed underprocess-level reward supervision, yet many sequential decision datasets record onlytrajectory-level outcomes. We develop a statistical theory for offlinepolicy optimizationfrom such outcome-level supervision. We first study the canonical setting where the target remains the expected cumulative reward, but each offline trajectory provides only a scalar label whose conditional mean is the cumulative return. We proposeOPAC, apessimistic actor-criticalgorithm that learns alatent reward modeland optimizes a policy from trajectory-level labels. We prove ahigh-probability guaranteeof order widetilde O(H^2C_{sa(π^star)/n}) and a matchinglower bound, characterizing the sharp statistical cost of replacing process-level rewards with one trajectory-level label. We then extend the principle topreference-based feedback, preserving the leading horizon andconcentrabilitydependence up to preference-model constants. Finally, we study generalized outcome-based offline RL, where both the supervision and the objective are trajectory-level quantities induced by a nonlinear aggregation of latent per-step rewards. This problem is not learnable in general: for all-success objectives, any offline learner may require Ω(2^H) trajectories even with deterministic transitions and constantconcentrability. We then identify a tractable regime through two structural coefficients, κ_μ(σ) and χ_μ(σ), capturing information loss in outcome aggregation and generalizedBellman updates, under which generalizedOPACachieves polynomialsample complexity. Together, our results delineate when outcome-level supervision enables sample-efficient offline control and when missing process-level rewards create fundamental statistical barriers.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.18531
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.18531 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.18531 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.18531 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Offline Preference-Based Trajectory Evaluation
This paper proposes offline preference-based trajectory evaluation for agentic systems, which compares trajectories via temporal preferences rather than binary success metrics. It shows that this approach reduces ties from roughly 75% to 35%, improving discriminative power and data efficiency across diverse benchmarks.
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning
Introduces IOP, a framework that internalizes outcome supervision into process supervision for reasoning reinforcement learning, enabling fine-grained credit assignment without external annotations.
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
StraTA proposes strategic trajectory abstraction for long-horizon LLM agents, using hierarchical GRPO-style rollout with diverse strategy sampling and critical self-judgment to improve sample efficiency and final performance over frontier models and prior RL baselines.
A Theory of Online Learning with Autoregressive Chain-of-Thought Reasoning
This academic paper develops a theoretical framework for online learning with autoregressive chain-of-thought reasoning, analyzing mistake bounds under end-to-end and trajectory supervision models.
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning
This paper introduces Retention-aware Policy Optimization (RaPO) to mitigate catastrophic forgetting in visual continual learning using reinforcement fine-tuning. RaPO uses trajectory-level reward shaping and cross-task advantage normalization to close the gap between reinforcement and supervised fine-tuning in class- and domain-incremental learning.