具有LLM指導者的組成式3D感知視頻生成
Compositional 3D-aware Video Generation with LLM Director
August 31, 2024
作者: Hanxin Zhu, Tianyu He, Anni Tang, Junliang Guo, Zhibo Chen, Jiang Bian
cs.AI
摘要
透過強大的生成模型和大規模互聯網數據的應用,在文本到視頻生成方面取得了顯著進展。然而,在精確控制生成視頻中的個別概念方面仍存在重大挑戰,例如特定角色的動作和外觀以及視點的移動。在這項工作中,我們提出了一種新的範式,該範式分別生成每個概念的3D表示,然後與大型語言模型(LLM)和2D擴散模型的先驗結合。具體而言,根據輸入的文本提示,我們的方案包括三個階段:1)我們利用LLM作為導演,首先將復雜的查詢分解為幾個子提示,這些子提示指示視頻中的個別概念(例如場景、物體、動作),然後讓LLM調用預先訓練的專家模型來獲取相應的概念的3D表示。2)為了組合這些表示,我們提示多模態LLM生成對象的軌跡的尺度和坐標的粗略指導。3)為了使生成的幀符合自然圖像分佈,我們進一步利用2D擴散先驗,並使用得分蒸餾抽樣來優化組合。大量實驗表明,我們的方法可以從文本生成高保真度的視頻,具有多樣的運動和對每個概念的靈活控制。項目頁面:https://aka.ms/c3v。
English
Significant progress has been made in text-to-video generation through the
use of powerful generative models and large-scale internet data. However,
substantial challenges remain in precisely controlling individual concepts
within the generated video, such as the motion and appearance of specific
characters and the movement of viewpoints. In this work, we propose a novel
paradigm that generates each concept in 3D representation separately and then
composes them with priors from Large Language Models (LLM) and 2D diffusion
models. Specifically, given an input textual prompt, our scheme consists of
three stages: 1) We leverage LLM as the director to first decompose the complex
query into several sub-prompts that indicate individual concepts within the
video~(e.g., scene, objects, motions), then we let LLM to invoke
pre-trained expert models to obtain corresponding 3D representations of
concepts. 2) To compose these representations, we prompt multi-modal LLM to
produce coarse guidance on the scales and coordinates of trajectories for the
objects. 3) To make the generated frames adhere to natural image distribution,
we further leverage 2D diffusion priors and use Score Distillation Sampling to
refine the composition. Extensive experiments demonstrate that our method can
generate high-fidelity videos from text with diverse motion and flexible
control over each concept. Project page: https://aka.ms/c3v.Summary
AI-Generated Summary