PolyVivid: クロスモーダルな相互作用と強化による鮮やかなマルチサブジェクト動画生成

要旨

近年のビデオ生成技術の進展にもかかわらず、既存のモデルは依然として細かな制御性、特に複数の主体をカスタマイズしつつ一貫したアイデンティティと相互作用を維持する能力に欠けています。本論文では、柔軟でアイデンティティに一貫した生成を可能にする多主体ビデオカスタマイズフレームワークであるPolyVividを提案します。主体画像とテキストエンティティ間の正確な対応関係を確立するため、視覚的アイデンティティをテキスト空間に埋め込むことで精密なグラウンディングを実現するVLLMベースのテキスト-画像融合モジュールを設計しました。さらに、アイデンティティの保持と主体間の相互作用を強化するため、テキストと画像の埋め込み間の構造化された双方向融合を可能にする3D-RoPEベースの拡張モジュールを提案します。また、融合されたアイデンティティ特徴をビデオ生成プロセスに効果的に注入し、アイデンティティのドリフトを軽減するための注意継承型アイデンティティ注入モジュールを開発しました。最後に、MLLMベースのグラウンディング、セグメンテーション、およびクリークベースの主体統合戦略を組み合わせたMLLMベースのデータパイプラインを構築し、高品質な多主体データを生成することで、下流のビデオ生成における主体の区別を強化し曖昧さを低減します。大規模な実験により、PolyVividがアイデンティティの忠実度、ビデオのリアリズム、および主体の整合性において優れた性能を発揮し、既存のオープンソースおよび商用のベースラインを凌駕することが実証されました。

English

Despite recent advances in video generation, existing models still lack fine-grained controllability, especially for multi-subject customization with consistent identity and interaction. In this paper, we propose PolyVivid, a multi-subject video customization framework that enables flexible and identity-consistent generation. To establish accurate correspondences between subject images and textual entities, we design a VLLM-based text-image fusion module that embeds visual identities into the textual space for precise grounding. To further enhance identity preservation and subject interaction, we propose a 3D-RoPE-based enhancement module that enables structured bidirectional fusion between text and image embeddings. Moreover, we develop an attention-inherited identity injection module to effectively inject fused identity features into the video generation process, mitigating identity drift. Finally, we construct an MLLM-based data pipeline that combines MLLM-based grounding, segmentation, and a clique-based subject consolidation strategy to produce high-quality multi-subject data, effectively enhancing subject distinction and reducing ambiguity in downstream video generation. Extensive experiments demonstrate that PolyVivid achieves superior performance in identity fidelity, video realism, and subject alignment, outperforming existing open-source and commercial baselines.