When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?

Hugging Face Daily Papers 06/16/26, 12:00 AM Papers

Summary

This paper develops a statistical theory for offline reinforcement learning from trajectory-level outcome supervision, proposing the OPAC algorithm and characterizing when such supervision enables efficient learning versus when fundamental barriers arise.

Offline reinforcement learning is typically analyzed under process-level reward supervision, yet many sequential decision datasets record only trajectory-level outcomes. We develop a statistical theory for offline policy optimization from such outcome-level supervision. We first study the canonical setting where the target remains the expected cumulative reward, but each offline trajectory provides only a scalar label whose conditional mean is the cumulative return. We propose OPAC, a pessimistic actor-critic algorithm that learns a latent reward model and optimizes a policy from trajectory-level labels. We prove a high-probability guarantee of order widetilde O(H^2C_{sa(π^star)/n}) and a matching lower bound, characterizing the sharp statistical cost of replacing process-level rewards with one trajectory-level label. We then extend the principle to preference-based feedback, preserving the leading horizon and concentrability dependence up to preference-model constants. Finally, we study generalized outcome-based offline RL, where both the supervision and the objective are trajectory-level quantities induced by a nonlinear aggregation of latent per-step rewards. This problem is not learnable in general: for all-success objectives, any offline learner may require Ω(2^H) trajectories even with deterministic transitions and constant concentrability. We then identify a tractable regime through two structural coefficients, κ_μ(σ) and χ_μ(σ), capturing information loss in outcome aggregation and generalized Bellman updates, under which generalized OPAC achieves polynomial sample complexity. Together, our results delineate when outcome-level supervision enables sample-efficient offline control and when missing process-level rewards create fundamental statistical barriers.

Original Article

View Cached Full Text

Cached at: 06/18/26, 07:58 PM

Paper page - When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?

Source: https://huggingface.co/papers/2606.18531

Abstract

Offline reinforcement learning with trajectory-level outcome supervision presents statistical challenges that can be addressed through pessimistic actor-critic methods, though fundamental barriers exist for certain generalized outcome-based problems.

Offline reinforcement learningis typically analyzed underprocess-level reward supervision, yet many sequential decision datasets record onlytrajectory-level outcomes. We develop a statistical theory for offlinepolicy optimizationfrom such outcome-level supervision. We first study the canonical setting where the target remains the expected cumulative reward, but each offline trajectory provides only a scalar label whose conditional mean is the cumulative return. We proposeOPAC, apessimistic actor-criticalgorithm that learns alatent reward modeland optimizes a policy from trajectory-level labels. We prove ahigh-probability guaranteeof order widetilde O(H^2C_{sa(π^star)/n}) and a matchinglower bound, characterizing the sharp statistical cost of replacing process-level rewards with one trajectory-level label. We then extend the principle topreference-based feedback, preserving the leading horizon andconcentrabilitydependence up to preference-model constants. Finally, we study generalized outcome-based offline RL, where both the supervision and the objective are trajectory-level quantities induced by a nonlinear aggregation of latent per-step rewards. This problem is not learnable in general: for all-success objectives, any offline learner may require Ω(2^H) trajectories even with deterministic transitions and constantconcentrability. We then identify a tractable regime through two structural coefficients, κ_μ(σ) and χ_μ(σ), capturing information loss in outcome aggregation and generalizedBellman updates, under which generalizedOPACachieves polynomialsample complexity. Together, our results delineate when outcome-level supervision enables sample-efficient offline control and when missing process-level rewards create fundamental statistical barriers.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2606\.18531

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.18531 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.18531 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.18531 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?

Paper page - When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Offline Preference-Based Trajectory Evaluation

Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

A Theory of Online Learning with Autoregressive Chain-of-Thought Reasoning

Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning

Submit Feedback

Similar Articles

Offline Preference-Based Trajectory Evaluation

Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

A Theory of Online Learning with Autoregressive Chain-of-Thought Reasoning

Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning