BindWeave：通過跨模態整合實現主題一致的視頻生成

摘要

擴散變換器在生成高保真視頻方面展現了顯著能力，能夠在長時間內提供視覺連貫的幀和豐富的細節。然而，現有的視頻生成模型在生成主體一致的視頻方面仍存在不足，這主要源於解析指定複雜空間關係、時間邏輯以及多主體間交互的提示時固有的困難。為解決這一問題，我們提出了BindWeave，這是一個統一框架，能夠處理從單一主體到複雜多主體場景的廣泛主體到視頻的生成任務。為了將複雜的提示語義綁定到具體的視覺主體上，我們引入了MLLM-DiT框架，其中預訓練的多模態大語言模型執行深度跨模態推理，以實體為基礎並解構角色、屬性和交互，生成主體感知的隱藏狀態，這些狀態作為條件輸入擴散變換器，從而實現高保真且主體一致的視頻生成。在OpenS2V基準測試上的實驗表明，我們的方法在生成視頻的主體一致性、自然度和文本相關性方面均取得了優異的表現，超越了現有的開源和商業模型。

English

Diffusion Transformer has shown remarkable abilities in generating high-fidelity videos, delivering visually coherent frames and rich details over extended durations. However, existing video generation models still fall short in subject-consistent video generation due to an inherent difficulty in parsing prompts that specify complex spatial relationships, temporal logic, and interactions among multiple subjects. To address this issue, we propose BindWeave, a unified framework that handles a broad range of subject-to-video scenarios from single-subject cases to complex multi-subject scenes with heterogeneous entities. To bind complex prompt semantics to concrete visual subjects, we introduce an MLLM-DiT framework in which a pretrained multimodal large language model performs deep cross-modal reasoning to ground entities and disentangle roles, attributes, and interactions, yielding subject-aware hidden states that condition the diffusion transformer for high-fidelity subject-consistent video generation. Experiments on the OpenS2V benchmark demonstrate that our method achieves superior performance across subject consistency, naturalness, and text relevance in generated videos, outperforming existing open-source and commercial models.

BindWeave：通過跨模態整合實現主題一致的視頻生成

BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration

摘要

Support