StressDream:引导视频世界模型实现鲁棒的策略评估与改进
StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement
May 29, 2026
作者: Junwon Seo, Sushant Veer, Ran Tian, Wenhao Ding, Apoorva Sharma, Karen Leung, Edward Schmerling, Marco Pavone, Andrea Bajcsy
cs.AI
摘要
视频世界模型通过依据机器人自身行为条件想象未来的真实观测,已在策略评估与改进方面展现出潜力。尽管世界模型能对未来分布进行建模,但策略评估与改进通常依赖于名义想象,除非抽取数量庞大的样本,否则这可能会遗漏机器人行为的高影响结果。为实现基于世界模型想象力的鲁棒策略评估与改进,我们提出StressDream方法,该方法通过在推理时优化扩散式世界模型的初始噪声,将想象力导向文本指定的高影响但合理的未来结果。然而,优化高维噪声颇具挑战:优化过程需在生成视频中推理细微且依赖场景的目标事件,同时避免产生导致不合理想象的分布外噪声。我们通过两个互补目标解决这一问题:一是基于视觉语言模型的语义目标,通过对生成视频进行推理提供信息丰富的梯度;二是合理性目标,防止优化后的噪声偏离分布。通过采用用于自动驾驶和机器人操作的最先进视频世界模型,我们证明StressDream能有效将想象力导向推理时文本指定的高影响但合理的未来结果(如任务失败),从而通过识别其合理未来包含不良后果的行为,实现鲁棒的策略评估与改进。视频结果可见于https://junwon.me/StressDream/。
English
Video world models (WMs) have shown promise for policy evaluation and improvement by imagining realistic future observations conditioned on ego-robot actions. While WMs can model distributions over futures, policy evaluation and improvement typically rely on nominal imaginations, which can miss high-impact outcomes of robot actions unless prohibitively many samples are drawn. To enable robust policy evaluation and improvement over WM imaginations, we propose StressDream, which steers imaginations toward high-impact yet plausible outcomes specified at inference time by optimizing the initial noise of diffusion-based WMs. However, optimizing high-dimensional noise is challenging: the optimization must reason about nuanced, scene-dependent target events in generated videos while avoiding out-of-distribution (OOD) noise that yields implausible imaginations. We address this with two complementary objectives: a semantic objective with a Vision-Language Model that provides informative gradients by reasoning about the generated video, and a plausibility objective that prevents the optimized noise from drifting OOD. With state-of-the-art video world models for autonomous driving and robotic manipulation, we show that StressDream effectively steers imaginations toward high-impact yet plausible outcomes specified by text at inference time, such as task failures, enabling robust policy evaluation and improvement by identifying actions whose plausible futures include undesirable outcomes. Video results are available at https://junwon.me/StressDream/.