ChatPaper.aiChatPaper

夢境關係:以關係為核心的影片客製化

DreamRelation: Relation-Centric Video Customization

March 10, 2025
作者: Yujie Wei, Shiwei Zhang, Hangjie Yuan, Biao Gong, Longxiang Tang, Xiang Wang, Haonan Qiu, Hengjia Li, Shuai Tan, Yingya Zhang, Hongming Shan
cs.AI

摘要

關係視頻定制是指創建描繪用戶指定兩個主體之間關係的個性化視頻,這對於理解現實世界中的視覺內容至關重要。雖然現有方法能夠個性化主體的外觀和動作,但在複雜的關係視頻定制方面仍存在困難,其中精確的關係建模和跨主體類別的高泛化能力至關重要。主要挑戰來自於關係中固有的複雜空間佈局、佈局變化和細微的時間動態;因此,當前模型往往過於強調不相關的視覺細節,而非捕捉有意義的互動。為應對這些挑戰,我們提出了DreamRelation,這是一種通過少量示例視頻個性化關係的新方法,利用兩個關鍵組件:關係解耦學習和關係動態增強。首先,在關係解耦學習中,我們使用關係LoRA三元組和混合掩碼訓練策略將關係與主體外觀分離,確保在不同關係中具有更好的泛化能力。此外,我們通過分析MM-DiT注意力機制中查詢、鍵和值特徵的獨特作用,確定了關係LoRA三元組的最佳設計,使DreamRelation成為首個具有可解釋組件的關係視頻生成框架。其次,在關係動態增強中,我們引入了時空關係對比損失,該損失優先考慮關係動態,同時最小化對詳細主體外觀的依賴。大量實驗表明,DreamRelation在關係視頻定制方面優於最先進的方法。代碼和模型將公開提供。
English
Relational video customization refers to the creation of personalized videos that depict user-specified relations between two subjects, a crucial task for comprehending real-world visual content. While existing methods can personalize subject appearances and motions, they still struggle with complex relational video customization, where precise relational modeling and high generalization across subject categories are essential. The primary challenge arises from the intricate spatial arrangements, layout variations, and nuanced temporal dynamics inherent in relations; consequently, current models tend to overemphasize irrelevant visual details rather than capturing meaningful interactions. To address these challenges, we propose DreamRelation, a novel approach that personalizes relations through a small set of exemplar videos, leveraging two key components: Relational Decoupling Learning and Relational Dynamics Enhancement. First, in Relational Decoupling Learning, we disentangle relations from subject appearances using relation LoRA triplet and hybrid mask training strategy, ensuring better generalization across diverse relationships. Furthermore, we determine the optimal design of relation LoRA triplet by analyzing the distinct roles of the query, key, and value features within MM-DiT's attention mechanism, making DreamRelation the first relational video generation framework with explainable components. Second, in Relational Dynamics Enhancement, we introduce space-time relational contrastive loss, which prioritizes relational dynamics while minimizing the reliance on detailed subject appearances. Extensive experiments demonstrate that DreamRelation outperforms state-of-the-art methods in relational video customization. Code and models will be made publicly available.

Summary

AI-Generated Summary

PDF141March 11, 2025