ChatPaper.aiChatPaper

知晓自身认知局限的世界模型:具有校准不确定性的可控视频生成

World Models That Know When They Don't Know: Controllable Video Generation with Calibrated Uncertainty

December 5, 2025
作者: Zhiting Mei, Tenny Yin, Micah Baker, Ola Shorinwa, Anirudha Majumdar
cs.AI

摘要

近期生成式视频模型的突破性进展,在高保真视频合成领域取得了重大成果,特别是在基于文本与动作输入的可控视频生成方面——例如指令引导的视频编辑和机器人学中的世界建模。尽管具备卓越的生成能力,可控视频模型常出现"幻觉"现象,即生成的未来视频帧与物理现实存在偏差,这在机器人策略评估与规划等任务中引发严重关切。然而,现有顶尖视频模型缺乏评估和表达自身置信度的能力,阻碍了幻觉缓解的进程。 为系统解决这一挑战,我们提出C3方法:一种用于训练连续尺度校准化可控视频模型的不确定性量化技术,可实现亚区块级别的密集置信度估计,精准定位每帧生成视频中的不确定性区域。我们的不确定性量化方法通过三大核心创新赋能视频模型进行不确定性评估:首先,开发了基于严格恰当评分规则的训练框架,使视频模型同时具备正确性与校准性;其次,在潜在空间内估算视频模型的不确定性,规避像素空间方法存在的训练不稳定性和过高计算成本;第三,将密集的潜在空间不确定性映射至可解释的像素级RGB空间不确定性,通过高分辨率热力图直观标识不可信区域。 基于大规模机器人学习数据集(Bridge与DROID)的广泛实验及现实场景验证表明,该方法不仅能在训练分布内提供校准化的不确定性估计,还能实现有效的分布外检测。
English
Recent advances in generative video models have led to significant breakthroughs in high-fidelity video synthesis, specifically in controllable video generation where the generated video is conditioned on text and action inputs, e.g., in instruction-guided video editing and world modeling in robotics. Despite these exceptional capabilities, controllable video models often hallucinate - generating future video frames that are misaligned with physical reality - which raises serious concerns in many tasks such as robot policy evaluation and planning. However, state-of-the-art video models lack the ability to assess and express their confidence, impeding hallucination mitigation. To rigorously address this challenge, we propose C3, an uncertainty quantification (UQ) method for training continuous-scale calibrated controllable video models for dense confidence estimation at the subpatch level, precisely localizing the uncertainty in each generated video frame. Our UQ method introduces three core innovations to empower video models to estimate their uncertainty. First, our method develops a novel framework that trains video models for correctness and calibration via strictly proper scoring rules. Second, we estimate the video model's uncertainty in latent space, avoiding training instability and prohibitive training costs associated with pixel-space approaches. Third, we map the dense latent-space uncertainty to interpretable pixel-level uncertainty in the RGB space for intuitive visualization, providing high-resolution uncertainty heatmaps that identify untrustworthy regions. Through extensive experiments on large-scale robot learning datasets (Bridge and DROID) and real-world evaluations, we demonstrate that our method not only provides calibrated uncertainty estimates within the training distribution, but also enables effective out-of-distribution detection.
PDF92December 9, 2025