확산 기반 정책에서 숨겨진 보상 복원하기

초록

본 논문은 스칼라 에너지 함수의 기울기가 노이즈 제거 필드가 되도록 매개변수화하여 생성적 행동 모델링과 역강화학습을 통합하는 EnergyFlow 프레임워크를 소개한다. 최대 엔트로피 최적성 하에서 노이즈 제거 점수 매칭을 통해 학습된 점수 함수가 전문가의 소프트 Q-함수의 기울기를 복원함으로써 적대적 학습 없이도 보상 추출이 가능함을 규명한다. 형식적으로, 학습된 필드를 보존적으로 제약하는 것이 가설 복잡도를 줄이고 분포 외 일반화 경계를 강화함을 증명한다. 더 나아가 복원된 보상의 식별 가능성을 규명하고 점수 추정 오류가 행동 선호도에 어떻게 전파되는지 한계를 규정한다. 실험적으로 EnergyFlow는 다양한 조작 작업에서 최첨단 모방 성능을 달성하는 동시에, 적대적 역강화학습 방법과 우도 기반 대안을 모두 능가하는 하위 강화학습을 위한 효과적인 보상 신호를 제공한다. 이러한 결과는 유효한 보상 추출에 필요한 구조적 제약이 동시에 정책 일반화에 유리한 귀납적 편향으로 작용함을 보여준다. 코드는 https://github.com/sotaagi/EnergyFlow에서 이용 가능하다.

English

This paper introduces EnergyFlow, a framework that unifies generative action modeling with inverse reinforcement learning by parameterizing a scalar energy function whose gradient is the denoising field. We establish that under maximum-entropy optimality, the score function learned via denoising score matching recovers the gradient of the expert's soft Q-function, enabling reward extraction without adversarial training. Formally, we prove that constraining the learned field to be conservative reduces hypothesis complexity and tightens out-of-distribution generalization bounds. We further characterize the identifiability of recovered rewards and bound how score estimation errors propagate to action preferences. Empirically, EnergyFlow achieves state-of-the-art imitation performance on various manipulation tasks while providing an effective reward signal for downstream reinforcement learning that outperforms both adversarial IRL methods and likelihood-based alternatives. These results show that the structural constraints required for valid reward extraction simultaneously serve as beneficial inductive biases for policy generalization. The code is available at https://github.com/sotaagi/EnergyFlow.

확산 기반 정책에서 숨겨진 보상 복원하기

Recovering Hidden Reward in Diffusion-Based Policies

초록

Support