邁向基於視覺語言模型規劃的物理可信視頻生成

摘要

近年來，視頻擴散模型（VDMs）取得了顯著進展，能夠生成高度逼真的視頻，並因其作為世界模擬器的潛力而引起了社區的關注。然而，儘管其能力強大，VDMs 往往由於對物理學的固有理解不足而無法生成物理上合理的視頻，導致動態和事件序列錯誤。為了解決這一限制，我們提出了一種新穎的兩階段圖像到視頻生成框架，該框架明確地融入了物理學。在第一階段，我們採用視覺語言模型（VLM）作為粗粒度運動規劃器，結合思維鏈和物理感知推理來預測近似真實世界物理動態的粗略運動軌跡/變化，同時確保幀間一致性。在第二階段，我們使用預測的運動軌跡/變化來指導 VDM 的視頻生成。由於預測的運動軌跡/變化是粗略的，在推理過程中會添加噪聲，以賦予 VDM 生成更精細運動細節的自由度。大量實驗結果表明，我們的框架能夠生成物理上合理的運動，比較評估也凸顯了我們方法相對於現有方法的顯著優勢。更多視頻結果請訪問我們的項目頁面：https://madaoer.github.io/projects/physically_plausible_video_generation。

English

Video diffusion models (VDMs) have advanced significantly in recent years, enabling the generation of highly realistic videos and drawing the attention of the community in their potential as world simulators. However, despite their capabilities, VDMs often fail to produce physically plausible videos due to an inherent lack of understanding of physics, resulting in incorrect dynamics and event sequences. To address this limitation, we propose a novel two-stage image-to-video generation framework that explicitly incorporates physics. In the first stage, we employ a Vision Language Model (VLM) as a coarse-grained motion planner, integrating chain-of-thought and physics-aware reasoning to predict a rough motion trajectories/changes that approximate real-world physical dynamics while ensuring the inter-frame consistency. In the second stage, we use the predicted motion trajectories/changes to guide the video generation of a VDM. As the predicted motion trajectories/changes are rough, noise is added during inference to provide freedom to the VDM in generating motion with more fine details. Extensive experimental results demonstrate that our framework can produce physically plausible motion, and comparative evaluations highlight the notable superiority of our approach over existing methods. More video results are available on our Project Page: https://madaoer.github.io/projects/physically_plausible_video_generation.

邁向基於視覺語言模型規劃的物理可信視頻生成

Towards Physically Plausible Video Generation via VLM Planning

摘要

Support