BindWeave：通过跨模态整合实现主题一致性的视频生成

摘要

扩散变换器在生成高保真视频方面展现了卓越的能力，能够提供视觉连贯的帧序列和丰富的细节，且持续时间长。然而，现有的视频生成模型在处理指定复杂空间关系、时间逻辑及多主体间交互的提示时，仍难以实现主体一致性视频生成。为解决这一问题，我们提出了BindWeave，一个统一框架，能够处理从单一主体到包含异质实体的复杂多主体场景的广泛主体到视频生成任务。为了将复杂的提示语义绑定到具体的视觉主体上，我们引入了一个MLLM-DiT框架，其中预训练的多模态大语言模型执行深度跨模态推理，以锚定实体并解耦角色、属性及交互，生成主体感知的隐藏状态，这些状态作为扩散变换器的条件，从而实现高保真的主体一致性视频生成。在OpenS2V基准测试上的实验表明，我们的方法在生成视频的主体一致性、自然度及文本相关性方面均取得了优异性能，超越了现有的开源和商业模型。

English

Diffusion Transformer has shown remarkable abilities in generating high-fidelity videos, delivering visually coherent frames and rich details over extended durations. However, existing video generation models still fall short in subject-consistent video generation due to an inherent difficulty in parsing prompts that specify complex spatial relationships, temporal logic, and interactions among multiple subjects. To address this issue, we propose BindWeave, a unified framework that handles a broad range of subject-to-video scenarios from single-subject cases to complex multi-subject scenes with heterogeneous entities. To bind complex prompt semantics to concrete visual subjects, we introduce an MLLM-DiT framework in which a pretrained multimodal large language model performs deep cross-modal reasoning to ground entities and disentangle roles, attributes, and interactions, yielding subject-aware hidden states that condition the diffusion transformer for high-fidelity subject-consistent video generation. Experiments on the OpenS2V benchmark demonstrate that our method achieves superior performance across subject consistency, naturalness, and text relevance in generated videos, outperforming existing open-source and commercial models.

BindWeave：通过跨模态整合实现主题一致性的视频生成

BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration

摘要

Support