BindWeave:通过跨模态整合实现主题一致性的视频生成
BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration
October 1, 2025
作者: Zhaoyang Li, Dongjun Qian, Kai Su, Qishuai Diao, Xiangyang Xia, Chang Liu, Wenfei Yang, Tianzhu Zhang, Zehuan Yuan
cs.AI
摘要
扩散变换器在生成高保真视频方面展现了卓越的能力,能够提供视觉连贯的帧序列和丰富的细节,且持续时间长。然而,现有的视频生成模型在处理指定复杂空间关系、时间逻辑及多主体间交互的提示时,仍难以实现主体一致性视频生成。为解决这一问题,我们提出了BindWeave,一个统一框架,能够处理从单一主体到包含异质实体的复杂多主体场景的广泛主体到视频生成任务。为了将复杂的提示语义绑定到具体的视觉主体上,我们引入了一个MLLM-DiT框架,其中预训练的多模态大语言模型执行深度跨模态推理,以锚定实体并解耦角色、属性及交互,生成主体感知的隐藏状态,这些状态作为扩散变换器的条件,从而实现高保真的主体一致性视频生成。在OpenS2V基准测试上的实验表明,我们的方法在生成视频的主体一致性、自然度及文本相关性方面均取得了优异性能,超越了现有的开源和商业模型。
English
Diffusion Transformer has shown remarkable abilities in generating
high-fidelity videos, delivering visually coherent frames and rich details over
extended durations. However, existing video generation models still fall short
in subject-consistent video generation due to an inherent difficulty in parsing
prompts that specify complex spatial relationships, temporal logic, and
interactions among multiple subjects. To address this issue, we propose
BindWeave, a unified framework that handles a broad range of subject-to-video
scenarios from single-subject cases to complex multi-subject scenes with
heterogeneous entities. To bind complex prompt semantics to concrete visual
subjects, we introduce an MLLM-DiT framework in which a pretrained multimodal
large language model performs deep cross-modal reasoning to ground entities and
disentangle roles, attributes, and interactions, yielding subject-aware hidden
states that condition the diffusion transformer for high-fidelity
subject-consistent video generation. Experiments on the OpenS2V benchmark
demonstrate that our method achieves superior performance across subject
consistency, naturalness, and text relevance in generated videos, outperforming
existing open-source and commercial models.