궤적 수준의 감독은 언제 효율적인 오프라인 강화 학습을 가능하게 하는가?

초록

오프라인 강화학습은 일반적으로 프로세스 단위 보상 감독 하에서 분석되지만, 많은 순차적 의사결정 데이터셋은 궤적 수준의 결과만 기록한다. 본 연구에서는 이러한 결과 수준 감독 하에서의 오프라인 정책 최적화를 위한 통계적 이론을 개발한다. 먼저, 목표가 여전히 기대 누적 보상인 표준적 설정을 연구하되, 각 오프라인 궤적이 조건부 평균이 누적 수익인 스칼라 레이블만을 제공하는 경우를 다룬다. 우리는 잠재 보상 모델을 학습하고 궤적 수준 레이블로부터 정책을 최적화하는 비관적 배우-비평가 알고리즘 OPAC을 제안한다. $\widetilde O(H^2 C_{s,a}(\pi^\star)/n)$ 차수의 높은 확률 보장과 이에 상응하는 하한을 증명하여, 프로세스 수준 보상을 하나의 궤적 수준 레이블로 대체할 때 발생하는 명확한 통계적 비용을 규명한다. 그 다음, 이 원리를 선호 기반 피드백으로 확장하며, 선호 모델 상수까지 주요 지평 및 집중성 의존성을 유지한다. 마지막으로, 일반화된 결과 기반 오프라인 강화학습을 연구하는데, 여기서 감독과 목표 모두 잠재적 단계별 보상의 비선형적 집계에 의해 유도된 궤적 수준 양이다. 이 문제는 일반적으로 학습 불가능하다: 모든 성공 목표에 대해, 어떤 오프라인 학습자도 결정론적 전이와 일정한 집중성 하에서도 $\Omega(2^H)$개의 궤적을 필요로 할 수 있다. 그 후, 결과 집계 및 일반화된 벨만 업데이트에서 정보 손실을 포착하는 두 구조적 계수 $\kappa_\mu(\sigma)$와 $\chi_\mu(\sigma)$를 통해 다루기 가능한 영역을 식별하며, 이 하에서 일반화된 OPAC이 다항식 샘플 복잡도를 달성함을 보인다. 종합적으로, 본 연구의 결과는 결과 수준 감독이 언제 샘플 효율적인 오프라인 제어를 가능하게 하고, 언제 프로세스 수준 보상의 부재가 근본적인 통계적 장벽을 생성하는지를 설명한다.

English

Offline reinforcement learning is typically analyzed under process-level reward supervision, yet many sequential decision datasets record only trajectory-level outcomes. We develop a statistical theory for offline policy optimization from such outcome-level supervision. We first study the canonical setting where the target remains the expected cumulative reward, but each offline trajectory provides only a scalar label whose conditional mean is the cumulative return. We propose OPAC, a pessimistic actor-critic algorithm that learns a latent reward model and optimizes a policy from trajectory-level labels. We prove a high-probability guarantee of order widetilde O(H^2C_{sa(π^star)/n}) and a matching lower bound, characterizing the sharp statistical cost of replacing process-level rewards with one trajectory-level label. We then extend the principle to preference-based feedback, preserving the leading horizon and concentrability dependence up to preference-model constants. Finally, we study generalized outcome-based offline RL, where both the supervision and the objective are trajectory-level quantities induced by a nonlinear aggregation of latent per-step rewards. This problem is not learnable in general: for all-success objectives, any offline learner may require Ω(2^H) trajectories even with deterministic transitions and constant concentrability. We then identify a tractable regime through two structural coefficients, κ_μ(σ) and χ_μ(σ), capturing information loss in outcome aggregation and generalized Bellman updates, under which generalized OPAC achieves polynomial sample complexity. Together, our results delineate when outcome-level supervision enables sample-efficient offline control and when missing process-level rewards create fundamental statistical barriers.