軌跡レベルの監督はいつ効率的なオフライン強化学習を可能にするのか？

要旨

オフライン強化学習は通常、プロセスレベルの報酬の監督の下で分析されるが、多くの逐次的意思決定データセットは軌跡レベルの結果のみを記録する。我々はこのような結果レベルの監督からのオフラインポリシー最適化のための統計理論を開発する。まず、目標が依然として期待累積報酬である標準的な設定を研究する。しかし、各オフライン軌跡は、条件付き平均が累積リターンであるスカラーラベルのみを提供する。我々は、潜在報酬モデルを学習し、軌跡レベルのラベルからポリシーを最適化する悲観的アクタークリティックアルゴリズムOPACを提案する。我々は、オーダー ~O(H^2 C_{sa(π^star)}/n)の高確率保証と一致する下界を証明し、プロセスレベルの報酬を1つの軌跡レベルのラベルに置き換えることの鋭い統計的コストを特徴づける。次に、我々はこの原理を嗜好ベースのフィードバックに拡張し、主要なホライゾンと集中可能性の依存関係を嗜好モデルの定数まで保持する。最後に、我々は一般化された結果ベースのオフライン強化学習を研究する。ここでは、監督と目的の両方が、潜在的な各ステップの報酬の非線形集約によって誘導される軌跡レベルの量である。この問題は一般に学習不可能である：全成功目的の場合、決定論的遷移と一定の集中可能性であっても、任意のオフライン学習者はΩ(2^H)個の軌跡を必要とする可能性がある。次に、我々は2つの構造係数κ_μ(σ)とχ_μ(σ)を通じて扱いやすい領域を特定する。これらは結果集約と一般化ベルマン更新における情報損失を捉え、その下で一般化OPACが多項式サンプル複雑性を達成する。まとめると、我々の結果は、結果レベルの監督がサンプル効率的なオフライン制御を可能にする場合と、プロセスレベルの報酬の欠如が根本的な統計的障壁を生み出す場合を明らかにする。

English

Offline reinforcement learning is typically analyzed under process-level reward supervision, yet many sequential decision datasets record only trajectory-level outcomes. We develop a statistical theory for offline policy optimization from such outcome-level supervision. We first study the canonical setting where the target remains the expected cumulative reward, but each offline trajectory provides only a scalar label whose conditional mean is the cumulative return. We propose OPAC, a pessimistic actor-critic algorithm that learns a latent reward model and optimizes a policy from trajectory-level labels. We prove a high-probability guarantee of order widetilde O(H^2C_{sa(π^star)/n}) and a matching lower bound, characterizing the sharp statistical cost of replacing process-level rewards with one trajectory-level label. We then extend the principle to preference-based feedback, preserving the leading horizon and concentrability dependence up to preference-model constants. Finally, we study generalized outcome-based offline RL, where both the supervision and the objective are trajectory-level quantities induced by a nonlinear aggregation of latent per-step rewards. This problem is not learnable in general: for all-success objectives, any offline learner may require Ω(2^H) trajectories even with deterministic transitions and constant concentrability. We then identify a tractable regime through two structural coefficients, κ_μ(σ) and χ_μ(σ), capturing information loss in outcome aggregation and generalized Bellman updates, under which generalized OPAC achieves polynomial sample complexity. Together, our results delineate when outcome-level supervision enables sample-efficient offline control and when missing process-level rewards create fundamental statistical barriers.