知晓自身不确定性的世界模型：基于校准不确定性的可控视频生成

摘要

近期生成式视频模型的突破性进展，在高保真视频合成领域取得了重大成果，特别是在基于文本与动作输入的可控视频生成方面——例如指令引导的视频编辑和机器人学中的世界建模。尽管具备卓越的生成能力，可控视频模型常出现"幻觉"现象：生成的未来视频帧与物理现实存在偏差，这在机器人策略评估与规划等任务中引发严重关切。然而当前最先进的视频模型缺乏评估和表达自身置信度的能力，阻碍了幻觉缓解。为系统解决这一挑战，我们提出C3方法——一种不确定性量化训练框架，可训练连续尺度校准的可控视频模型，实现亚区块级别的稠密置信度估计，精准定位每帧生成视频中的不确定区域。我们的不确定性量化方法通过三大核心创新赋能视频模型：首先，开发了基于严格恰当评分规则的训练框架，使视频模型同时具备正确性与校准性；其次，在潜在空间估计视频模型的不确定性，规避像素空间方法存在的训练不稳定与过高计算成本；第三，将稠密的潜在空间不确定性映射至可解释的像素级RGB空间，通过高分辨率不确定性热力图直观标识不可信区域。基于大规模机器人学习数据集（Bridge与DROID）的广泛实验及现实场景验证表明，我们的方法不仅能提供训练分布内的校准不确定性估计，还可实现有效的分布外检测。

English

Recent advances in generative video models have led to significant breakthroughs in high-fidelity video synthesis, specifically in controllable video generation where the generated video is conditioned on text and action inputs, e.g., in instruction-guided video editing and world modeling in robotics. Despite these exceptional capabilities, controllable video models often hallucinate - generating future video frames that are misaligned with physical reality - which raises serious concerns in many tasks such as robot policy evaluation and planning. However, state-of-the-art video models lack the ability to assess and express their confidence, impeding hallucination mitigation. To rigorously address this challenge, we propose C3, an uncertainty quantification (UQ) method for training continuous-scale calibrated controllable video models for dense confidence estimation at the subpatch level, precisely localizing the uncertainty in each generated video frame. Our UQ method introduces three core innovations to empower video models to estimate their uncertainty. First, our method develops a novel framework that trains video models for correctness and calibration via strictly proper scoring rules. Second, we estimate the video model's uncertainty in latent space, avoiding training instability and prohibitive training costs associated with pixel-space approaches. Third, we map the dense latent-space uncertainty to interpretable pixel-level uncertainty in the RGB space for intuitive visualization, providing high-resolution uncertainty heatmaps that identify untrustworthy regions. Through extensive experiments on large-scale robot learning datasets (Bridge and DROID) and real-world evaluations, we demonstrate that our method not only provides calibrated uncertainty estimates within the training distribution, but also enables effective out-of-distribution detection.

知晓自身不确定性的世界模型：基于校准不确定性的可控视频生成

World Models That Know When They Don't Know: Controllable Video Generation with Calibrated Uncertainty

摘要

Support