WISA:面向物理感知文本到视频生成的世界模拟器助手
WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation
March 11, 2025
作者: Jing Wang, Ao Ma, Ke Cao, Jun Zheng, Zhanjie Zhang, Jiasong Feng, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, Yuhui Yin, Xiaodan Liang
cs.AI
摘要
近期,文本到视频(T2V)生成技术如SoRA和Kling的快速发展,展现了构建世界模拟器的巨大潜力。然而,当前的T2V模型在理解抽象物理原理及生成符合物理定律的视频方面仍面临挑战。这一难题主要源于抽象物理原理与生成模型之间存在显著鸿沟,导致物理信息缺乏明确指导。为此,我们引入了世界模拟器助手(WISA),一个有效框架,旨在将物理原理分解并融入T2V模型中。具体而言,WISA将物理原理分解为文本物理描述、定性物理类别及定量物理属性。为了在生成过程中有效嵌入这些物理特性,WISA整合了多项关键设计,包括物理专家混合注意力机制(MoPA)和物理分类器,从而增强模型的物理感知能力。此外,现有大多数数据集中的视频要么物理现象表现薄弱,要么与多种并发过程交织,限制了它们作为学习明确物理原理专用资源的适用性。我们提出了一种新颖的视频数据集WISA-32K,基于定性物理类别收集,包含32,000个视频,涵盖动力学、热力学和光学三个物理领域的17条物理定律。实验结果表明,WISA能有效提升T2V模型与现实世界物理定律的兼容性,在VideoPhy基准测试上取得了显著进步。WISA及WISA-32K的视觉展示可访问https://360cvgroup.github.io/WISA/。
English
Recent rapid advancements in text-to-video (T2V) generation, such as SoRA and
Kling, have shown great potential for building world simulators. However,
current T2V models struggle to grasp abstract physical principles and generate
videos that adhere to physical laws. This challenge arises primarily from a
lack of clear guidance on physical information due to a significant gap between
abstract physical principles and generation models. To this end, we introduce
the World Simulator Assistant (WISA), an effective framework for decomposing
and incorporating physical principles into T2V models. Specifically, WISA
decomposes physical principles into textual physical descriptions, qualitative
physical categories, and quantitative physical properties. To effectively embed
these physical attributes into the generation process, WISA incorporates
several key designs, including Mixture-of-Physical-Experts Attention (MoPA) and
a Physical Classifier, enhancing the model's physics awareness. Furthermore,
most existing datasets feature videos where physical phenomena are either
weakly represented or entangled with multiple co-occurring processes, limiting
their suitability as dedicated resources for learning explicit physical
principles. We propose a novel video dataset, WISA-32K, collected based on
qualitative physical categories. It consists of 32,000 videos, representing 17
physical laws across three domains of physics: dynamics, thermodynamics, and
optics. Experimental results demonstrate that WISA can effectively enhance the
compatibility of T2V models with real-world physical laws, achieving a
considerable improvement on the VideoPhy benchmark. The visual exhibitions of
WISA and WISA-32K are available in the https://360cvgroup.github.io/WISA/.Summary
AI-Generated Summary