디퓨전 월드 모델

초록

우리는 다단계 미래 상태와 보상을 동시에 예측할 수 있는 조건부 확산 모델인 Diffusion World Model(DWM)을 소개한다. 기존의 단일 단계 역학 모델과 달리, DWM은 단일 순방향 전파로 장기 예측을 제공하여 재귀적 쿼리의 필요성을 제거한다. 우리는 DWM을 모델 기반 가치 추정에 통합했으며, 여기서 단기 수익은 DWM에서 샘플링된 미래 궤적을 통해 시뮬레이션된다. 오프라인 강화 학습의 맥락에서, DWM은 생성 모델링을 통한 보수적 가치 정규화로 볼 수 있다. 또는 합성 데이터를 사용한 오프라인 Q-러닝을 가능하게 하는 데이터 소스로 간주할 수도 있다. D4RL 데이터셋에 대한 실험을 통해 DWM이 장기 시뮬레이션에서 견고함을 확인했다. 절대적 성능 측면에서 DWM은 단일 단계 역학 모델을 44%의 성능 향상으로 크게 능가하며, 최첨단 성능을 달성했다.

English

We introduce Diffusion World Model (DWM), a conditional diffusion model capable of predicting multistep future states and rewards concurrently. As opposed to traditional one-step dynamics models, DWM offers long-horizon predictions in a single forward pass, eliminating the need for recursive quires. We integrate DWM into model-based value estimation, where the short-term return is simulated by future trajectories sampled from DWM. In the context of offline reinforcement learning, DWM can be viewed as a conservative value regularization through generative modeling. Alternatively, it can be seen as a data source that enables offline Q-learning with synthetic data. Our experiments on the D4RL dataset confirm the robustness of DWM to long-horizon simulation. In terms of absolute performance, DWM significantly surpasses one-step dynamics models with a 44% performance gain, and achieves state-of-the-art performance.