作为流均衡的组合视频生成
Compositional Video Generation as Flow Equalization
June 10, 2024
作者: Xingyi Yang, Xinchao Wang
cs.AI
摘要
最近,大规模文本到视频(T2V)扩散模型展示了前所未有的能力,将自然语言描述转换为令人惊叹且逼真的视频。尽管取得了令人期待的结果,但仍存在一个重要挑战:这些模型难以完全理解多个概念和动作之间的复杂组合互动。当一些词语主导性地影响最终视频时,会掩盖其他概念,从而产生这个问题。为了解决这个问题,我们引入了Vico,一个用于组合视频生成的通用框架,明确确保所有概念得到适当表示。在其核心,Vico分析输入标记如何影响生成的视频,并调整模型以防止任何单一概念主导。具体而言,Vico从所有层中提取注意力权重以构建空间-时间注意力图,然后估计从源文本标记到视频目标标记的最大流作为影响。尽管在扩散模型中直接计算注意力流通常是不可行的,但我们设计了一种基于子图流的高效近似,并采用了快速且矢量化的实现,从而使流计算变得可管理且可微分。通过更新嘈杂的潜在因子来平衡这些流,Vico捕捉复杂互动,从而生成与文本描述紧密符合的视频。我们将该方法应用于多个基于扩散的视频模型,用于组合T2V和视频编辑。实证结果表明,我们的框架显著增强了生成视频的组合丰富性和准确性。欢迎访问我们的网站:https://adamdad.github.io/vico/。
English
Large-scale Text-to-Video (T2V) diffusion models have recently demonstrated
unprecedented capability to transform natural language descriptions into
stunning and photorealistic videos. Despite the promising results, a
significant challenge remains: these models struggle to fully grasp complex
compositional interactions between multiple concepts and actions. This issue
arises when some words dominantly influence the final video, overshadowing
other concepts.To tackle this problem, we introduce Vico, a generic
framework for compositional video generation that explicitly ensures all
concepts are represented properly. At its core, Vico analyzes how input tokens
influence the generated video, and adjusts the model to prevent any single
concept from dominating. Specifically, Vico extracts attention weights from all
layers to build a spatial-temporal attention graph, and then estimates the
influence as the max-flow from the source text token to the video target
token. Although the direct computation of attention flow in diffusion models is
typically infeasible, we devise an efficient approximation based on subgraph
flows and employ a fast and vectorized implementation, which in turn makes the
flow computation manageable and differentiable. By updating the noisy latent to
balance these flows, Vico captures complex interactions and consequently
produces videos that closely adhere to textual descriptions. We apply our
method to multiple diffusion-based video models for compositional T2V and video
editing. Empirical results demonstrate that our framework significantly
enhances the compositional richness and accuracy of the generated videos. Visit
our website
at~https://adamdad.github.io/vico/{https://adamdad.github.io/vico/}.Summary
AI-Generated Summary