軌跡層級監督何時能實現高效的離線強化學習?
When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?
June 16, 2026
作者: Xuanfei Ren, Tengyang Xie
cs.AI
摘要
离线强化学习通常在过程级奖励监督下进行分析,然而许多序列决策数据集仅记录轨迹级结果。我们针对这种结果级监督下的离线策略优化发展了一套统计理论。首先研究一个典型设定:目标仍是期望累积奖励,但每条离线轨迹仅提供一个标量标签,其条件均值等于累积回报。我们提出OPAC算法,一种悲观演员-评论家算法,该算法学习一个潜在奖励模型,并从轨迹级标签中优化策略。我们证明了高阶概率保证为\(\widetilde{O}\left(\frac{H^2 C_{sa(\pi^\star)}}{n}\right)\),并给出了匹配的下界,刻画了用一条轨迹级标签替代过程级奖励所带来的尖锐统计代价。随后将该原理扩展到基于偏好的反馈,在偏好模型常数范围内保持领先的时域步长和可集中性依赖。最后,我们研究广义基于结果的离线强化学习,其中监督和目标均为由潜在每步奖励的非线性聚合产生的轨迹级量。该问题通常不可学习:对于全成功目标,即使具有确定性转移和恒定可集中性,任何离线学习器都可能需要\(\Omega(2^H)\)条轨迹。接着,我们通过两个结构系数\(\kappa_\mu(\sigma)\)和\(\chi_\mu(\sigma)\)识别出一个可处理区域,这两个系数捕获了结果聚合中的信息损失和广义贝尔曼更新,在此条件下广义OPAC实现了多项式样本复杂度。综上所述,我们的结果刻画了何时结果级监督能够实现样本高效的离线控制,以及何时缺失过程级奖励会形成根本性的统计障碍。
English
Offline reinforcement learning is typically analyzed under process-level reward supervision, yet many sequential decision datasets
record only trajectory-level outcomes. We develop a statistical theory for offline policy optimization from such outcome-level
supervision. We first study the canonical setting where the target remains the expected cumulative reward, but each offline trajectory
provides only a scalar label whose conditional mean is the cumulative return. We propose OPAC, a pessimistic actor-critic algorithm
that learns a latent reward model and optimizes a policy from trajectory-level labels. We prove a high-probability guarantee of order
widetilde O(H^2C_{sa(π^star)/n}) and a matching lower bound, characterizing the sharp statistical cost of replacing
process-level rewards with one trajectory-level label. We then extend the principle to preference-based feedback, preserving the
leading horizon and concentrability dependence up to preference-model constants. Finally, we study generalized outcome-based offline
RL, where both the supervision and the objective are trajectory-level quantities induced by a nonlinear aggregation of latent per-step
rewards. This problem is not learnable in general: for all-success objectives, any offline learner may require Ω(2^H)
trajectories even with deterministic transitions and constant concentrability. We then identify a tractable regime through two
structural coefficients, κ_μ(σ) and χ_μ(σ), capturing information loss in outcome aggregation and
generalized Bellman updates, under which generalized OPAC achieves polynomial sample complexity. Together, our results delineate when
outcome-level supervision enables sample-efficient offline control and when missing process-level rewards create fundamental
statistical barriers.