作為流均衡的組合式視頻生成
Compositional Video Generation as Flow Equalization
June 10, 2024
作者: Xingyi Yang, Xinchao Wang
cs.AI
摘要
最近,大規模文本到視頻(T2V)擴散模型展示了前所未有的能力,能夠將自然語言描述轉換為令人驚嘆且逼真的視頻。儘管取得了令人鼓舞的成果,但仍存在一個重大挑戰:這些模型在完全理解多個概念和動作之間的複雜組合互動方面仍有困難。當一些詞語佔主導地位影響最終視頻時,就會出現這個問題,壓過其他概念。為應對這個問題,我們引入了Vico,一個用於組合式視頻生成的通用框架,明確確保所有概念得到適當表示。在其核心,Vico分析輸入標記如何影響生成的視頻,並調整模型以防止任何單一概念佔主導地位。具體而言,Vico從所有層中提取注意權重以構建空間-時間注意力圖,然後估計從源文本標記到視頻目標標記的最大流作為影響。儘管在擴散模型中直接計算注意力流通常是不可行的,但我們設計了一個基於子圖流的高效近似方法,並採用了快速且向量化的實現,從而使流計算變得可管理且可微分。通過更新噪聲潛在以平衡這些流,Vico捕捉到複雜的互動,從而生成與文本描述密切符合的視頻。我們將我們的方法應用於多個基於擴散的視頻模型,用於組合式T2V和視頻編輯。實證結果表明,我們的框架顯著增強了生成視頻的組合豐富性和準確性。請訪問我們的網站:https://adamdad.github.io/vico/。
English
Large-scale Text-to-Video (T2V) diffusion models have recently demonstrated
unprecedented capability to transform natural language descriptions into
stunning and photorealistic videos. Despite the promising results, a
significant challenge remains: these models struggle to fully grasp complex
compositional interactions between multiple concepts and actions. This issue
arises when some words dominantly influence the final video, overshadowing
other concepts.To tackle this problem, we introduce Vico, a generic
framework for compositional video generation that explicitly ensures all
concepts are represented properly. At its core, Vico analyzes how input tokens
influence the generated video, and adjusts the model to prevent any single
concept from dominating. Specifically, Vico extracts attention weights from all
layers to build a spatial-temporal attention graph, and then estimates the
influence as the max-flow from the source text token to the video target
token. Although the direct computation of attention flow in diffusion models is
typically infeasible, we devise an efficient approximation based on subgraph
flows and employ a fast and vectorized implementation, which in turn makes the
flow computation manageable and differentiable. By updating the noisy latent to
balance these flows, Vico captures complex interactions and consequently
produces videos that closely adhere to textual descriptions. We apply our
method to multiple diffusion-based video models for compositional T2V and video
editing. Empirical results demonstrate that our framework significantly
enhances the compositional richness and accuracy of the generated videos. Visit
our website
at~https://adamdad.github.io/vico/{https://adamdad.github.io/vico/}.Summary
AI-Generated Summary