StressDream: 강건한 정책 평가 및 개선을 위한 비디오 세계 모델 조정

초록

비디오 세계 모델(World Models, WMs)은 자아 로봇의 행동에 조건화된 미래 관측치를 사실적으로 상상함으로써 정책 평가 및 개선에 유망한 접근법을 보여주었다. WMs는 미래에 대한 분포를 모델링할 수 있지만, 정책 평가와 개선은 일반적으로 명목적 상상에 의존하며, 이는 지나치게 많은 샘플을 추출하지 않는 한 로봇 행동의 영향력이 큰 결과를 놓칠 수 있다. WM 상상을 통한 강건한 정책 평가 및 개선을 가능하게 하기 위해, 우리는 StressDream을 제안한다. 이는 추론 시점에 텍스트로 지정된 높은 영향력을 가지면서도 개연성 있는 결과로 상상을 유도하며, 확산 기반 WM의 초기 잡음을 최적화한다. 그러나 고차원 잡음의 최적화는 까다롭다. 최적화는 생성된 비디오에서 미묘하고 장면 의존적인 대상 이벤트를 추론해야 하며, 동시에 개연성 없는 상상을 초래하는 분포 외(Out-of-Distribution, OOD) 잡음을 피해야 한다. 우리는 이를 두 가지 상호 보완적인 목표로 해결한다: 생성된 비디오를 추론하여 정보성 있는 그래디언트를 제공하는 비전-언어 모델(Vision-Language Model)을 활용한 의미론적 목표와, 최적화된 잡음이 OOD로 표류하는 것을 방지하는 개연성 목표이다. 자율 주행 및 로봇 조작을 위한 최첨단 비디오 세계 모델을 사용하여, StressDream이 추론 시점에 텍스트로 지정된 높은 영향력이면서도 개연성 있는 결과(예: 작업 실패)로 상상을 효과적으로 유도하며, 바람직하지 않은 결과를 포함한 개연성 있는 미래를 가진 행동을 식별함으로써 강건한 정책 평가 및 개선을 가능하게 함을 보여준다. 비디오 결과는 https://junwon.me/StressDream/에서 확인할 수 있다.

English

Video world models (WMs) have shown promise for policy evaluation and improvement by imagining realistic future observations conditioned on ego-robot actions. While WMs can model distributions over futures, policy evaluation and improvement typically rely on nominal imaginations, which can miss high-impact outcomes of robot actions unless prohibitively many samples are drawn. To enable robust policy evaluation and improvement over WM imaginations, we propose StressDream, which steers imaginations toward high-impact yet plausible outcomes specified at inference time by optimizing the initial noise of diffusion-based WMs. However, optimizing high-dimensional noise is challenging: the optimization must reason about nuanced, scene-dependent target events in generated videos while avoiding out-of-distribution (OOD) noise that yields implausible imaginations. We address this with two complementary objectives: a semantic objective with a Vision-Language Model that provides informative gradients by reasoning about the generated video, and a plausibility objective that prevents the optimized noise from drifting OOD. With state-of-the-art video world models for autonomous driving and robotic manipulation, we show that StressDream effectively steers imaginations toward high-impact yet plausible outcomes specified by text at inference time, such as task failures, enabling robust policy evaluation and improvement by identifying actions whose plausible futures include undesirable outcomes. Video results are available at https://junwon.me/StressDream/.