MagicComp:面向組合式視頻生成的無訓練雙階段精煉方法
MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation
March 18, 2025
作者: Hongyu Zhang, Yufan Deng, Shenghai Yuan, Peng Jin, Zesen Cheng, Yian Zhao, Chang Liu, Jie Chen
cs.AI
摘要
文本到視頻(T2V)生成技術在擴散模型的推動下取得了顯著進展。然而,現有方法在精確綁定屬性、確定空間關係以及捕捉多主體間複雜動作交互方面仍面臨挑戰。為解決這些限制,我們提出了MagicComp,這是一種無需訓練的方法,通過雙階段精煉來增強組合式T2V生成。具體而言,(1)在條件階段:我們引入了語義錨點消歧,通過逐步將語義錨點的方向向量注入原始文本嵌入中,強化主體特定的語義並解決主體間的歧義;(2)在去噪階段:我們提出了動態佈局融合注意力機制,該機制整合了基礎先驗和模型自適應的空間感知,通過掩碼注意力調製靈活地將主體綁定到其時空區域。此外,MagicComp是一種模型無關且多功能的解決方案,能夠無縫集成到現有的T2V架構中。在T2V-CompBench和VBench上的大量實驗表明,MagicComp超越了最先進的方法,凸顯了其在基於複雜提示和軌跡可控視頻生成等應用中的潛力。項目頁面:https://hong-yu-zhang.github.io/MagicComp-Page/。
English
Text-to-video (T2V) generation has made significant strides with diffusion
models. However, existing methods still struggle with accurately binding
attributes, determining spatial relationships, and capturing complex action
interactions between multiple subjects. To address these limitations, we
propose MagicComp, a training-free method that enhances compositional T2V
generation through dual-phase refinement. Specifically, (1) During the
Conditioning Stage: We introduce the Semantic Anchor Disambiguation to
reinforces subject-specific semantics and resolve inter-subject ambiguity by
progressively injecting the directional vectors of semantic anchors into
original text embedding; (2) During the Denoising Stage: We propose Dynamic
Layout Fusion Attention, which integrates grounding priors and model-adaptive
spatial perception to flexibly bind subjects to their spatiotemporal regions
through masked attention modulation. Furthermore, MagicComp is a model-agnostic
and versatile approach, which can be seamlessly integrated into existing T2V
architectures. Extensive experiments on T2V-CompBench and VBench demonstrate
that MagicComp outperforms state-of-the-art methods, highlighting its potential
for applications such as complex prompt-based and trajectory-controllable video
generation. Project page: https://hong-yu-zhang.github.io/MagicComp-Page/.Summary
AI-Generated Summary