ChatPaper.aiChatPaper

目标驱动力:教导视频模型实现物理条件约束下的目标

Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals

January 9, 2026
作者: Nate Gillman, Yinghua Zhou, Zitian Tang, Evan Luo, Arjan Chakravarthy, Daksh Aggarwal, Michael Freeman, Charles Herrmann, Chen Sun
cs.AI

摘要

近期视频生成技术的进步催生了能够为机器人与规划领域模拟潜在未来的"世界模型"。然而,为这些模型设定精确目标仍存在挑战:文本指令往往过于抽象而难以捕捉物理细节,而目标图像对于动态任务又常常难以具体指定。为此,我们提出"目标作用力"这一创新框架,允许用户通过显式力向量和中间动力学过程来定义目标,这与人脑构思物理任务的方式高度契合。我们在精心构建的合成因果基元数据集(如弹性碰撞和多米诺骨牌倾倒)上训练视频生成模型,使其学会在时空维度传递作用力。尽管仅基于简单物理数据训练,我们的模型在复杂现实场景(包括工具操控和多物体因果链)中展现出卓越的零样本泛化能力。研究结果表明,通过将视频生成锚定于基础物理相互作用,模型可演化为隐式神经物理模拟器,无需依赖外部引擎即可实现精确的物理感知规划。我们已在项目页面开源全部数据集、代码、模型权重及交互式视频演示。
English
Recent advancements in video generation have enabled the development of ``world models'' capable of simulating potential futures for robotics and planning. However, specifying precise goals for these models remains a challenge; text instructions are often too abstract to capture physical nuances, while target images are frequently infeasible to specify for dynamic tasks. To address this, we introduce Goal Force, a novel framework that allows users to define goals via explicit force vectors and intermediate dynamics, mirroring how humans conceptualize physical tasks. We train a video generation model on a curated dataset of synthetic causal primitives-such as elastic collisions and falling dominos-teaching it to propagate forces through time and space. Despite being trained on simple physics data, our model exhibits remarkable zero-shot generalization to complex, real-world scenarios, including tool manipulation and multi-object causal chains. Our results suggest that by grounding video generation in fundamental physical interactions, models can emerge as implicit neural physics simulators, enabling precise, physics-aware planning without reliance on external engines. We release all datasets, code, model weights, and interactive video demos at our project page.
PDF111January 13, 2026