ChatPaper.aiChatPaper

VIMI:透過多模式指示來建立視頻生成

VIMI: Grounding Video Generation through Multi-modal Instruction

July 8, 2024
作者: Yuwei Fang, Willi Menapace, Aliaksandr Siarohin, Tsai-Shien Chen, Kuan-Chien Wang, Ivan Skorokhodov, Graham Neubig, Sergey Tulyakov
cs.AI

摘要

現有的文本轉視頻擴散模型僅依賴於僅有文本的編碼器進行預訓練。這種限制源於缺乏大規模多模態提示視頻數據集,導致缺乏視覺基礎並限制了其在多模態整合中的多樣性和應用。為了應對這一問題,我們通過利用檢索方法將上下文示例與給定的文本提示配對,構建了一個大規模多模態提示數據集,然後利用兩階段訓練策略實現同一模型內多樣的視頻生成任務。在第一階段,我們提出了一個多模態條件視頻生成框架,用於在這些擴增數據集上進行預訓練,為基於視覺基礎的視頻生成建立了基礎模型。其次,我們在三個視頻生成任務上對第一階段的模型進行微調,並納入多模態指令。這個過程進一步提升了模型處理多樣輸入和任務的能力,確保了多模態信息的無縫整合。經過這兩階段的訓練過程後,VIMI展現出多模態理解能力,生成基於提供的輸入的具有豐富上下文和個性化的視頻,如圖1所示。與先前的視覺基礎視頻生成方法相比,VIMI能夠合成具有大幅運動的一致且時間上連貫的視頻,同時保留語義控制。最後,VIMI還在UCF101基準測試中實現了最先進的文本轉視頻生成結果。
English
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining. This limitation stems from the absence of large-scale multimodal prompt video datasets, resulting in a lack of visual grounding and restricting their versatility and application in multimodal integration. To address this, we construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts and then utilize a two-stage training strategy to enable diverse video generation tasks within the same model. In the first stage, we propose a multimodal conditional video generation framework for pretraining on these augmented datasets, establishing a foundational model for grounded video generation. Secondly, we finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions. This process further refines the model's ability to handle diverse inputs and tasks, ensuring seamless integration of multi-modal information. After this two-stage train-ing process, VIMI demonstrates multimodal understanding capabilities, producing contextually rich and personalized videos grounded in the provided inputs, as shown in Figure 1. Compared to previous visual grounded video generation methods, VIMI can synthesize consistent and temporally coherent videos with large motion while retaining the semantic control. Lastly, VIMI also achieves state-of-the-art text-to-video generation results on UCF101 benchmark.

Summary

AI-Generated Summary

PDF101November 28, 2024