WISA:面向物理感知文本到視頻生成的世界模擬器助手
WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation
March 11, 2025
作者: Jing Wang, Ao Ma, Ke Cao, Jun Zheng, Zhanjie Zhang, Jiasong Feng, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, Yuhui Yin, Xiaodan Liang
cs.AI
摘要
近期,文本到視頻(T2V)生成技術的快速進展,如SoRA和Kling,展現了構建世界模擬器的巨大潛力。然而,現有的T2V模型在理解抽象物理原理及生成符合物理法則的視頻方面仍面臨挑戰。這一挑戰主要源於抽象物理原理與生成模型之間存在顯著鴻溝,導致物理信息缺乏明確指導。為此,我們引入了世界模擬助手(WISA),這是一個有效框架,用於分解並將物理原理融入T2V模型。具體而言,WISA將物理原理分解為文本物理描述、定性物理類別和定量物理屬性。為了有效將這些物理屬性嵌入生成過程,WISA採用了多項關鍵設計,包括物理專家混合注意力機制(MoPA)和物理分類器,從而增強模型的物理感知能力。此外,現有數據集中大多數視頻的物理現象要么表現微弱,要么與多個同時發生的過程交織,限制了它們作為學習明確物理原理專用資源的適用性。我們提出了一個基於定性物理類別收集的新穎視頻數據集——WISA-32K,它包含32,000個視頻,涵蓋動力學、熱力學和光學三個物理領域的17條物理定律。實驗結果表明,WISA能有效提升T2V模型與現實世界物理法則的兼容性,在VideoPhy基準測試上取得了顯著進步。WISA及WISA-32K的視覺展示可於https://360cvgroup.github.io/WISA/查看。
English
Recent rapid advancements in text-to-video (T2V) generation, such as SoRA and
Kling, have shown great potential for building world simulators. However,
current T2V models struggle to grasp abstract physical principles and generate
videos that adhere to physical laws. This challenge arises primarily from a
lack of clear guidance on physical information due to a significant gap between
abstract physical principles and generation models. To this end, we introduce
the World Simulator Assistant (WISA), an effective framework for decomposing
and incorporating physical principles into T2V models. Specifically, WISA
decomposes physical principles into textual physical descriptions, qualitative
physical categories, and quantitative physical properties. To effectively embed
these physical attributes into the generation process, WISA incorporates
several key designs, including Mixture-of-Physical-Experts Attention (MoPA) and
a Physical Classifier, enhancing the model's physics awareness. Furthermore,
most existing datasets feature videos where physical phenomena are either
weakly represented or entangled with multiple co-occurring processes, limiting
their suitability as dedicated resources for learning explicit physical
principles. We propose a novel video dataset, WISA-32K, collected based on
qualitative physical categories. It consists of 32,000 videos, representing 17
physical laws across three domains of physics: dynamics, thermodynamics, and
optics. Experimental results demonstrate that WISA can effectively enhance the
compatibility of T2V models with real-world physical laws, achieving a
considerable improvement on the VideoPhy benchmark. The visual exhibitions of
WISA and WISA-32K are available in the https://360cvgroup.github.io/WISA/.Summary
AI-Generated Summary