贝尼尼：潜在语义规划用于视频扩散

摘要

多模态大语言模型（MLLMs）与扩散模型各自已达到显著成熟度：前者在异构多模态输入推理中表现出色，具备强大的语义锚定能力；后者则能以照片级逼真度合成图像与视频。我们认为，这两类模型可通过简单的分工实现统一：MLLMs负责语义规划，而扩散模型则依据高层语义指引与底层视觉特征渲染像素。基于这一思路，我们提出 Bernini——一个统一的视频生成与编辑框架。其中，基于MLLM的规划器直接在ViT嵌入空间中预测目标语义表示；基于DiT的渲染器则根据该规划结果，结合文本特征（并在编辑任务中辅以源VAE特征以保留细节）合成像素。由于语义充当接口，规划器与渲染器可分别训练，仅需少量联合微调，从而在保持训练高效的同时保留两个组件的预训练优势。为更好处理多视觉输入，我们引入了分段感知3D旋转位置编码（SA-3D RoPE），并在规划器中融入思维链推理，以将理解更有效地迁移至生成过程。Bernini在广泛的视频生成与编辑基准测试中均达到了最先进性能，而MLLM预训练的理解能力也转化为在挑战性编辑任务上的强大泛化能力。

English

Multimodal large language models (MLLMs) and diffusion models have each reached remarkable maturity: MLLMs excel at reasoning over heterogeneous multimodal inputs with strong semantic grounding, while diffusion models synthesize images and videos with photorealistic fidelity. We argue that these two families can be unified through a simple division of labor: MLLMs perform semantic planning, while diffusion models render pixels from high-level semantic guidance and low-level visual features. Building on this idea, we propose Bernini, a unified framework for video generation and editing. An MLLM-based planner predicts the target semantic representation directly in the ViT embedding space, and a DiT-based renderer synthesizes pixels conditioned on this plan, augmented by text features and, for editing, source VAE features for detail preservation. Because semantics serve as the interface, the planner and renderer can be trained separately and only lightly co-trained, preserving the pretrained strengths of both components while keeping training efficient. To better handle multiple visual inputs, we introduce Segment-Aware 3D Rotary Positional Embedding (SA-3D RoPE), and further incorporate chain-of-thought reasoning in the planner to better transfer understanding into generation. Bernini achieves state-of-the-art performance across a wide range of video generation and editing benchmarks, with the MLLM's pretrained understanding translating into strong generalization on challenging editing tasks.