PolyVivid：跨模态交互与增强的多主体生动视频生成

摘要

尽管视频生成领域近期取得了显著进展，现有模型在细粒度可控性方面仍显不足，尤其是在多主体定制及其身份一致性与交互性方面。本文提出PolyVivid，一个多主体视频定制框架，旨在实现灵活且身份一致的生成。为建立主体图像与文本实体间的精确对应关系，我们设计了一个基于VLLM的文本-图像融合模块，将视觉身份嵌入文本空间以实现精准定位。为进一步增强身份保持与主体交互，我们提出了基于3D-RoPE的增强模块，支持文本与图像嵌入间的结构化双向融合。此外，我们开发了注意力继承的身份注入模块，有效将融合后的身份特征注入视频生成过程，减轻身份漂移问题。最后，我们构建了一个基于MLLM的数据处理流程，结合MLLM的定位、分割及基于团簇的主体整合策略，生成高质量的多主体数据，显著提升主体区分度并减少下游视频生成中的歧义。大量实验表明，PolyVivid在身份保真度、视频真实感及主体对齐方面均表现出色，超越了现有的开源与商业基线模型。

English

Despite recent advances in video generation, existing models still lack fine-grained controllability, especially for multi-subject customization with consistent identity and interaction. In this paper, we propose PolyVivid, a multi-subject video customization framework that enables flexible and identity-consistent generation. To establish accurate correspondences between subject images and textual entities, we design a VLLM-based text-image fusion module that embeds visual identities into the textual space for precise grounding. To further enhance identity preservation and subject interaction, we propose a 3D-RoPE-based enhancement module that enables structured bidirectional fusion between text and image embeddings. Moreover, we develop an attention-inherited identity injection module to effectively inject fused identity features into the video generation process, mitigating identity drift. Finally, we construct an MLLM-based data pipeline that combines MLLM-based grounding, segmentation, and a clique-based subject consolidation strategy to produce high-quality multi-subject data, effectively enhancing subject distinction and reducing ambiguity in downstream video generation. Extensive experiments demonstrate that PolyVivid achieves superior performance in identity fidelity, video realism, and subject alignment, outperforming existing open-source and commercial baselines.

PolyVivid：跨模态交互与增强的多主体生动视频生成

PolyVivid: Vivid Multi-Subject Video Generation with Cross-Modal Interaction and Enhancement

摘要

Support