StressDream: ロバストな方策評価と改善のためのビデオワールドモデルの誘導

要旨

ビデオワールドモデル（WM）は、エゴロボットの動作に条件づけられた現実的な未来の観測を想像することにより、ポリシー評価と改善に有望であることが示されている。WMは未来の分布をモデル化できる一方で、ポリシー評価と改善は通常、名目的な想像に依存しており、これではロボット動作の影響の大きい結果を見逃す可能性がある（非常に多くのサンプルを取得しない限り）。WMの想像に基づくロバストなポリシー評価と改善を実現するために、我々はStressDreamを提案する。これは、拡散ベースのWMの初期ノイズを最適化することで、推論時に指定された影響が大きくかつもっともらしい結果へ想像を誘導する。しかし、高次元ノイズの最適化は困難である。最適化では、生成された動画内の微妙でシーンに依存したターゲットイベントを推論しつつ、非現実的な想像をもたらす分布外（OOD）ノイズを回避する必要がある。我々はこれを、生成動画を推論することで情報的な勾配を提供する視覚言語モデルを用いた意味的目的と、最適化されたノイズがOODに逸脱するのを防ぐもっともらしさ目的という、2つの相補的な目的で解決する。自動運転とロボット操作のための最先端のビデオワールドモデルを用いて、StressDreamが、タスク失敗などのテキストで指定された影響が大きくもっともらしい結果へ、推論時に想像を効果的に誘導し、もっともらしい未来に望ましくない結果を含む動作を特定することで、ロバストなポリシー評価と改善を可能にすることを示す。動画結果は https://junwon.me/StressDream/ で入手可能である。

English

Video world models (WMs) have shown promise for policy evaluation and improvement by imagining realistic future observations conditioned on ego-robot actions. While WMs can model distributions over futures, policy evaluation and improvement typically rely on nominal imaginations, which can miss high-impact outcomes of robot actions unless prohibitively many samples are drawn. To enable robust policy evaluation and improvement over WM imaginations, we propose StressDream, which steers imaginations toward high-impact yet plausible outcomes specified at inference time by optimizing the initial noise of diffusion-based WMs. However, optimizing high-dimensional noise is challenging: the optimization must reason about nuanced, scene-dependent target events in generated videos while avoiding out-of-distribution (OOD) noise that yields implausible imaginations. We address this with two complementary objectives: a semantic objective with a Vision-Language Model that provides informative gradients by reasoning about the generated video, and a plausibility objective that prevents the optimized noise from drifting OOD. With state-of-the-art video world models for autonomous driving and robotic manipulation, we show that StressDream effectively steers imaginations toward high-impact yet plausible outcomes specified by text at inference time, such as task failures, enabling robust policy evaluation and improvement by identifying actions whose plausible futures include undesirable outcomes. Video results are available at https://junwon.me/StressDream/.