PolyVivid: 교차 모달 상호작용 및 강화를 통한 생생한 다중 주제 비디오 생성

초록

비디오 생성 분야의 최근 발전에도 불구하고, 기존 모델들은 여전히 세밀한 제어 가능성, 특히 일관된 정체성과 상호작용을 갖춘 다중 주체 맞춤화 측면에서 부족함을 보입니다. 본 논문에서는 유연하고 정체성 일관성을 유지한 생성을 가능하게 하는 다중 주체 비디오 맞춤화 프레임워크인 PolyVivid를 제안합니다. 주체 이미지와 텍스트 엔티티 간의 정확한 대응 관계를 설정하기 위해, 우리는 시각적 정체성을 텍스트 공간에 임베딩하여 정확한 그라운딩을 가능하게 하는 VLLM 기반 텍스트-이미지 융합 모듈을 설계했습니다. 또한, 정체성 보존과 주체 간 상호작용을 더욱 강화하기 위해, 텍스트와 이미지 임베딩 간의 구조화된 양방향 융합을 가능하게 하는 3D-RoPE 기반 강화 모듈을 제안합니다. 더 나아가, 융합된 정체성 특징을 비디오 생성 과정에 효과적으로 주입하여 정체성 드리프트를 완화하는 주의 상속 정체성 주입 모듈을 개발했습니다. 마지막으로, MLLM 기반 그라운딩, 세그멘테이션, 그리고 클리크 기반 주체 통합 전략을 결합한 MLLM 기반 데이터 파이프라인을 구축하여, 다운스트림 비디오 생성에서 주체 구분을 효과적으로 강화하고 모호성을 줄이는 고품질 다중 주체 데이터를 생성합니다. 광범위한 실험을 통해 PolyVivid가 정체성 충실도, 비디오 현실감, 주체 정렬 측면에서 우수한 성능을 달성하며, 기존의 오픈소스 및 상용 베이스라인을 능가함을 입증했습니다.

English

Despite recent advances in video generation, existing models still lack fine-grained controllability, especially for multi-subject customization with consistent identity and interaction. In this paper, we propose PolyVivid, a multi-subject video customization framework that enables flexible and identity-consistent generation. To establish accurate correspondences between subject images and textual entities, we design a VLLM-based text-image fusion module that embeds visual identities into the textual space for precise grounding. To further enhance identity preservation and subject interaction, we propose a 3D-RoPE-based enhancement module that enables structured bidirectional fusion between text and image embeddings. Moreover, we develop an attention-inherited identity injection module to effectively inject fused identity features into the video generation process, mitigating identity drift. Finally, we construct an MLLM-based data pipeline that combines MLLM-based grounding, segmentation, and a clique-based subject consolidation strategy to produce high-quality multi-subject data, effectively enhancing subject distinction and reducing ambiguity in downstream video generation. Extensive experiments demonstrate that PolyVivid achieves superior performance in identity fidelity, video realism, and subject alignment, outperforming existing open-source and commercial baselines.