作为流均衡的组合视频生成

摘要

最近，大规模文本到视频（T2V）扩散模型展示了前所未有的能力，将自然语言描述转换为令人惊叹且逼真的视频。尽管取得了令人期待的结果，但仍存在一个重要挑战：这些模型难以完全理解多个概念和动作之间的复杂组合互动。当一些词语主导性地影响最终视频时，会掩盖其他概念，从而产生这个问题。为了解决这个问题，我们引入了Vico，一个用于组合视频生成的通用框架，明确确保所有概念得到适当表示。在其核心，Vico分析输入标记如何影响生成的视频，并调整模型以防止任何单一概念主导。具体而言，Vico从所有层中提取注意力权重以构建空间-时间注意力图，然后估计从源文本标记到视频目标标记的最大流作为影响。尽管在扩散模型中直接计算注意力流通常是不可行的，但我们设计了一种基于子图流的高效近似，并采用了快速且矢量化的实现，从而使流计算变得可管理且可微分。通过更新嘈杂的潜在因子来平衡这些流，Vico捕捉复杂互动，从而生成与文本描述紧密符合的视频。我们将该方法应用于多个基于扩散的视频模型，用于组合T2V和视频编辑。实证结果表明，我们的框架显著增强了生成视频的组合丰富性和准确性。欢迎访问我们的网站：https://adamdad.github.io/vico/。

English

Large-scale Text-to-Video (T2V) diffusion models have recently demonstrated unprecedented capability to transform natural language descriptions into stunning and photorealistic videos. Despite the promising results, a significant challenge remains: these models struggle to fully grasp complex compositional interactions between multiple concepts and actions. This issue arises when some words dominantly influence the final video, overshadowing other concepts.To tackle this problem, we introduce Vico, a generic framework for compositional video generation that explicitly ensures all concepts are represented properly. At its core, Vico analyzes how input tokens influence the generated video, and adjusts the model to prevent any single concept from dominating. Specifically, Vico extracts attention weights from all layers to build a spatial-temporal attention graph, and then estimates the influence as the max-flow from the source text token to the video target token. Although the direct computation of attention flow in diffusion models is typically infeasible, we devise an efficient approximation based on subgraph flows and employ a fast and vectorized implementation, which in turn makes the flow computation manageable and differentiable. By updating the noisy latent to balance these flows, Vico captures complex interactions and consequently produces videos that closely adhere to textual descriptions. We apply our method to multiple diffusion-based video models for compositional T2V and video editing. Empirical results demonstrate that our framework significantly enhances the compositional richness and accuracy of the generated videos. Visit our website at~https://adamdad.github.io/vico/{https://adamdad.github.io/vico/}.

作为流均衡的组合视频生成

Compositional Video Generation as Flow Equalization

摘要

Support