ChatPaper.aiChatPaper

StressDream:引導影片世界模型以實現穩健的策略評估與改進

StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement

May 29, 2026
作者: Junwon Seo, Sushant Veer, Ran Tian, Wenhao Ding, Apoorva Sharma, Karen Leung, Edward Schmerling, Marco Pavone, Andrea Bajcsy
cs.AI

摘要

視頻世界模型(WMs)在基於機器人自身動作條件下,透過想像逼真的未來觀測,已在策略評估與改進方面展現潛力。儘管世界模型能對未來分佈進行建模,但策略評估與改進通常依賴於名義上的想像,這可能忽略機器人動作的高影響結果,除非抽取大量樣本。為實現對世界模型想像的穩健策略評估與改進,我們提出 StressDream,該方法透過優化基於擴散的世界模型的初始噪聲,在推理時將想像引導至由使用者指定、高影響且合理的結果。然而,優化高維噪聲極具挑戰:優化過程必須在生成的影片中推理細緻且依場景的目標事件,同時避免產生不合理想像的分佈外(OOD)噪聲。我們透過兩個互補目標來解決此問題:一個語義目標,利用視覺語言模型透過推理生成的影片提供具資訊性的梯度;以及一個合理性目標,防止優化後的噪聲偏移至分佈外。結合用於自動駕駛與機器人操作的最新視頻世界模型,我們展示 StressDream 能有效將想像引導至由文字在推理時指定的高影響且合理結果(例如任務失敗),從而透過識別那些合理未來包含不良結果的動作,實現穩健的策略評估與改進。影片結果請見 https://junwon.me/StressDream/。
English
Video world models (WMs) have shown promise for policy evaluation and improvement by imagining realistic future observations conditioned on ego-robot actions. While WMs can model distributions over futures, policy evaluation and improvement typically rely on nominal imaginations, which can miss high-impact outcomes of robot actions unless prohibitively many samples are drawn. To enable robust policy evaluation and improvement over WM imaginations, we propose StressDream, which steers imaginations toward high-impact yet plausible outcomes specified at inference time by optimizing the initial noise of diffusion-based WMs. However, optimizing high-dimensional noise is challenging: the optimization must reason about nuanced, scene-dependent target events in generated videos while avoiding out-of-distribution (OOD) noise that yields implausible imaginations. We address this with two complementary objectives: a semantic objective with a Vision-Language Model that provides informative gradients by reasoning about the generated video, and a plausibility objective that prevents the optimized noise from drifting OOD. With state-of-the-art video world models for autonomous driving and robotic manipulation, we show that StressDream effectively steers imaginations toward high-impact yet plausible outcomes specified by text at inference time, such as task failures, enabling robust policy evaluation and improvement by identifying actions whose plausible futures include undesirable outcomes. Video results are available at https://junwon.me/StressDream/.