PolyVivid：具跨模態互動與增強的多主體生動視頻生成

摘要

儘管視頻生成技術近期取得了進展，現有模型仍缺乏細粒度的可控性，特別是在多主體定制方面，難以保持身份一致性和互動性。本文提出PolyVivid，這是一個多主體視頻定制框架，能夠實現靈活且身份一致的生成。為了在主體圖像與文本實體之間建立精確的對應關係，我們設計了一個基於VLLM的文本-圖像融合模塊，將視覺身份嵌入文本空間以實現精確的定位。為了進一步增強身份保持和主體互動，我們提出了一個基於3D-RoPE的增強模塊，實現了文本與圖像嵌入之間的結構化雙向融合。此外，我們開發了一個注意力繼承的身份注入模塊，有效地將融合的身份特徵注入視頻生成過程，減輕身份漂移。最後，我們構建了一個基於MLLM的數據管道，結合了基於MLLM的定位、分割以及基於團的主體整合策略，生成高質量的多主體數據，有效增強了主體區分度並減少了下游視頻生成中的歧義。大量實驗表明，PolyVivid在身份保真度、視頻真實性和主體對齊方面表現優異，超越了現有的開源和商業基準。

English

Despite recent advances in video generation, existing models still lack fine-grained controllability, especially for multi-subject customization with consistent identity and interaction. In this paper, we propose PolyVivid, a multi-subject video customization framework that enables flexible and identity-consistent generation. To establish accurate correspondences between subject images and textual entities, we design a VLLM-based text-image fusion module that embeds visual identities into the textual space for precise grounding. To further enhance identity preservation and subject interaction, we propose a 3D-RoPE-based enhancement module that enables structured bidirectional fusion between text and image embeddings. Moreover, we develop an attention-inherited identity injection module to effectively inject fused identity features into the video generation process, mitigating identity drift. Finally, we construct an MLLM-based data pipeline that combines MLLM-based grounding, segmentation, and a clique-based subject consolidation strategy to produce high-quality multi-subject data, effectively enhancing subject distinction and reducing ambiguity in downstream video generation. Extensive experiments demonstrate that PolyVivid achieves superior performance in identity fidelity, video realism, and subject alignment, outperforming existing open-source and commercial baselines.

PolyVivid：具跨模態互動與增強的多主體生動視頻生成

PolyVivid: Vivid Multi-Subject Video Generation with Cross-Modal Interaction and Enhancement

摘要

Support