何时轨迹级别的监督允许高效的离线强化学习？

摘要

离线强化学习通常在过程级奖励监督下进行分析，但许多序列决策数据集仅记录轨迹级结果。我们针对此类结果级监督下的离线策略优化建立了一套统计理论。首先研究规范设定：目标仍是期望累积奖励，但每条离线轨迹仅提供一个标量标签，其条件均值即为累积回报。我们提出OPAC，一种悲观演员-评论家算法，该算法从轨迹级标签中学习潜在奖励模型并优化策略。我们证明了阶为~O(H^2C_{sa(π^star)/n})的高概率保证及其匹配下界，刻画了以单条轨迹级标签替代过程级奖励所对应的精确统计代价。随后我们将该原理扩展到基于偏好的反馈，保留了主导的深度和集中性依赖关系直至偏好模型常数。最后，我们研究广义基于结果的离线强化学习，其中监督信号和目标均为由潜在每步奖励的非线性聚合所诱导的轨迹级量。该问题通常不可学习：对于全成功目标，即使在确定性转移和恒定集中性条件下，任何离线学习器都可能需要Ω(2^H)条轨迹。我们进一步通过两个结构系数κ_μ(σ)和χ_μ(σ)识别出一个可处理区间，这两个系数捕捉了结果聚合和广义贝尔曼更新中的信息损失，在此条件下广义OPAC实现了多项式样本复杂度。我们的结果共同界定了何时轨迹级监督能够实现样本高效的离线控制，以及何时缺失的过程级奖励会构成根本性统计障碍。

English

Offline reinforcement learning is typically analyzed under process-level reward supervision, yet many sequential decision datasets record only trajectory-level outcomes. We develop a statistical theory for offline policy optimization from such outcome-level supervision. We first study the canonical setting where the target remains the expected cumulative reward, but each offline trajectory provides only a scalar label whose conditional mean is the cumulative return. We propose OPAC, a pessimistic actor-critic algorithm that learns a latent reward model and optimizes a policy from trajectory-level labels. We prove a high-probability guarantee of order widetilde O(H^2C_{sa(π^star)/n}) and a matching lower bound, characterizing the sharp statistical cost of replacing process-level rewards with one trajectory-level label. We then extend the principle to preference-based feedback, preserving the leading horizon and concentrability dependence up to preference-model constants. Finally, we study generalized outcome-based offline RL, where both the supervision and the objective are trajectory-level quantities induced by a nonlinear aggregation of latent per-step rewards. This problem is not learnable in general: for all-success objectives, any offline learner may require Ω(2^H) trajectories even with deterministic transitions and constant concentrability. We then identify a tractable regime through two structural coefficients, κ_μ(σ) and χ_μ(σ), capturing information loss in outcome aggregation and generalized Bellman updates, under which generalized OPAC achieves polynomial sample complexity. Together, our results delineate when outcome-level supervision enables sample-efficient offline control and when missing process-level rewards create fundamental statistical barriers.