CustomVideoX:3D 參考注意力驅動的動態適應,用於零樣本定制視頻擴散變壓器
CustomVideoX: 3D Reference Attention Driven Dynamic Adaptation for Zero-Shot Customized Video Diffusion Transformers
February 10, 2025
作者: D. She, Mushui Liu, Jingxuan Pang, Jin Wang, Zhen Yang, Wanggui He, Guanghao Zhang, Yi Wang, Qihan Huang, Haobin Tang, Yunlong Yu, Siming Fu
cs.AI
摘要
在圖像合成方面,定制生成已取得顯著進展,然而個性化視頻生成仍然具有挑戰性,因為存在時間不一致性和質量降級問題。本文介紹了CustomVideoX,一個創新的框架,利用視頻擴散變換器從參考圖像生成個性化視頻。CustomVideoX利用預訓練視頻網絡,僅通過訓練LoRA參數來提取參考特徵,確保效率和適應性。為了促進參考圖像和視頻內容之間的無縫互動,我們提出了3D參考注意力,實現參考圖像特徵與所有視頻幀在空間和時間維度上的直接和同時互動。為了在推斷過程中減輕參考圖像特徵和文本引導對生成的視頻內容產生過多影響,我們實現了時間感知參考注意力偏差(TAB)策略,動態調節不同時間步驟上的參考偏差。此外,我們引入了實體區域感知增強(ERAE)模塊,通過調整注意力偏差,將關鍵實體標記的高度激活區域與參考特徵注入對齊。為了全面評估個性化視頻生成,我們建立了一個新的基準測試集VideoBench,包括50多個對象和100個提示,進行廣泛評估。實驗結果表明,CustomVideoX在視頻一致性和質量方面顯著優於現有方法。
English
Customized generation has achieved significant progress in image synthesis,
yet personalized video generation remains challenging due to temporal
inconsistencies and quality degradation. In this paper, we introduce
CustomVideoX, an innovative framework leveraging the video diffusion
transformer for personalized video generation from a reference image.
CustomVideoX capitalizes on pre-trained video networks by exclusively training
the LoRA parameters to extract reference features, ensuring both efficiency and
adaptability. To facilitate seamless interaction between the reference image
and video content, we propose 3D Reference Attention, which enables direct and
simultaneous engagement of reference image features with all video frames
across spatial and temporal dimensions. To mitigate the excessive influence of
reference image features and textual guidance on generated video content during
inference, we implement the Time-Aware Reference Attention Bias (TAB) strategy,
dynamically modulating reference bias over different time steps. Additionally,
we introduce the Entity Region-Aware Enhancement (ERAE) module, aligning highly
activated regions of key entity tokens with reference feature injection by
adjusting attention bias. To thoroughly evaluate personalized video generation,
we establish a new benchmark, VideoBench, comprising over 50 objects and 100
prompts for extensive assessment. Experimental results show that CustomVideoX
significantly outperforms existing methods in terms of video consistency and
quality.Summary
AI-Generated Summary