ChatPaper.aiChatPaper

Phantom:通過跨模態對齊實現主體一致性的影片生成

Phantom: Subject-consistent video generation via cross-modal alignment

February 16, 2025
作者: Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Qian He, Xinglong Wu
cs.AI

摘要

隨著視頻生成基礎模型的持續發展,其應用領域正不斷拓展,而主體一致的視頻生成仍處於探索階段。我們將此稱為「主體到視頻」(Subject-to-Video),即從參考圖像中提取主體元素,並通過文本指令生成與主體一致的視頻。我們認為,主體到視頻的核心在於平衡文本與圖像的雙模態提示,從而深度且同步地對齊文本與視覺內容。為此,我們提出了Phantom,一個適用於單一及多主體參考的統一視頻生成框架。基於現有的文本到視頻和圖像到視頻架構,我們重新設計了聯合文本-圖像注入模型,並通過文本-圖像-視頻三元組數據驅動其學習跨模態對齊。特別地,我們在人物生成中強調主體一致性,涵蓋了現有的ID保持視頻生成,同時提供了更優越的性能。項目主頁請訪問:https://phantom-video.github.io/Phantom/。
English
The continuous development of foundational models for video generation is evolving into various applications, with subject-consistent video generation still in the exploratory stage. We refer to this as Subject-to-Video, which extracts subject elements from reference images and generates subject-consistent video through textual instructions. We believe that the essence of subject-to-video lies in balancing the dual-modal prompts of text and image, thereby deeply and simultaneously aligning both text and visual content. To this end, we propose Phantom, a unified video generation framework for both single and multi-subject references. Building on existing text-to-video and image-to-video architectures, we redesign the joint text-image injection model and drive it to learn cross-modal alignment via text-image-video triplet data. In particular, we emphasize subject consistency in human generation, covering existing ID-preserving video generation while offering enhanced advantages. The project homepage is here https://phantom-video.github.io/Phantom/.

Summary

AI-Generated Summary

PDF603February 19, 2025