MagicComp：面向组合视频生成的无训练双阶段优化框架

摘要

文本到视频（T2V）生成技术借助扩散模型已取得显著进展。然而，现有方法在准确绑定属性、确定空间关系以及捕捉多主体间复杂动作交互方面仍面临挑战。为应对这些局限，我们提出了MagicComp，一种无需训练的双阶段优化方法，旨在提升组合式T2V生成效果。具体而言，（1）在条件阶段：我们引入了语义锚点消歧技术，通过逐步将语义锚点的方向向量注入原始文本嵌入，强化主体特定语义并解决主体间歧义；（2）在去噪阶段：我们提出了动态布局融合注意力机制，该机制结合了定位先验和模型自适应的空间感知，通过掩码注意力调制灵活地将主体绑定至其时空区域。此外，MagicComp是一种模型无关且多功能的解决方案，能够无缝集成到现有的T2V架构中。在T2V-CompBench和VBench上的大量实验表明，MagicComp超越了当前最先进的方法，展现了其在基于复杂提示和轨迹可控视频生成等应用中的潜力。项目页面：https://hong-yu-zhang.github.io/MagicComp-Page/。

English

Text-to-video (T2V) generation has made significant strides with diffusion models. However, existing methods still struggle with accurately binding attributes, determining spatial relationships, and capturing complex action interactions between multiple subjects. To address these limitations, we propose MagicComp, a training-free method that enhances compositional T2V generation through dual-phase refinement. Specifically, (1) During the Conditioning Stage: We introduce the Semantic Anchor Disambiguation to reinforces subject-specific semantics and resolve inter-subject ambiguity by progressively injecting the directional vectors of semantic anchors into original text embedding; (2) During the Denoising Stage: We propose Dynamic Layout Fusion Attention, which integrates grounding priors and model-adaptive spatial perception to flexibly bind subjects to their spatiotemporal regions through masked attention modulation. Furthermore, MagicComp is a model-agnostic and versatile approach, which can be seamlessly integrated into existing T2V architectures. Extensive experiments on T2V-CompBench and VBench demonstrate that MagicComp outperforms state-of-the-art methods, highlighting its potential for applications such as complex prompt-based and trajectory-controllable video generation. Project page: https://hong-yu-zhang.github.io/MagicComp-Page/.

MagicComp：面向组合视频生成的无训练双阶段优化框架

MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation

摘要

Support