目标驱动力:教授视频模型实现物理条件约束下的目标达成
Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals
January 9, 2026
作者: Nate Gillman, Yinghua Zhou, Zitian Tang, Evan Luo, Arjan Chakravarthy, Daksh Aggarwal, Michael Freeman, Charles Herrmann, Chen Sun
cs.AI
摘要
视频生成技术的最新进展使得能够模拟机器人学和规划中潜在未来的"世界模型"得以发展。然而,为这些模型设定精确目标仍具挑战性:文本指令往往过于抽象而难以捕捉物理细节,而目标图像对于动态任务又常常难以具体指定。为此,我们提出Goal Force创新框架,允许用户通过明确的力向量和中间动力学过程来定义目标,这与人脑构思物理任务的方式相契合。我们在精心构建的合成因果基元数据集(如弹性碰撞和多米诺骨牌倾倒)上训练视频生成模型,教会其在时空维度传递力的作用。尽管仅使用简单物理数据进行训练,我们的模型在复杂现实场景(包括工具操作和多物体因果链)中展现出卓越的零样本泛化能力。研究结果表明,通过将视频生成建立在基础物理交互之上,模型能够作为隐式神经物理模拟器出现,实现不依赖外部引擎的精确、物理感知的规划。我们已在项目页面公开所有数据集、代码、模型权重及交互式视频演示。
English
Recent advancements in video generation have enabled the development of ``world models'' capable of simulating potential futures for robotics and planning. However, specifying precise goals for these models remains a challenge; text instructions are often too abstract to capture physical nuances, while target images are frequently infeasible to specify for dynamic tasks. To address this, we introduce Goal Force, a novel framework that allows users to define goals via explicit force vectors and intermediate dynamics, mirroring how humans conceptualize physical tasks. We train a video generation model on a curated dataset of synthetic causal primitives-such as elastic collisions and falling dominos-teaching it to propagate forces through time and space. Despite being trained on simple physics data, our model exhibits remarkable zero-shot generalization to complex, real-world scenarios, including tool manipulation and multi-object causal chains. Our results suggest that by grounding video generation in fundamental physical interactions, models can emerge as implicit neural physics simulators, enabling precise, physics-aware planning without reliance on external engines. We release all datasets, code, model weights, and interactive video demos at our project page.