Bernini: 潜在セマンティックプランニングによるビデオ拡散

要旨

マルチモーダル大規模言語モデル（MLLM）と拡散モデルはそれぞれ目覚ましい成熟を遂げている。MLLMは強力な意味的基盤に基づき、多様なマルチモーダル入力を推論することに長けている一方、拡散モデルは写真のようにリアルな画像や動画を合成する。我々は、これら二つのファミリーが単純な役割分担によって統合可能であると主張する。すなわち、MLLMが意味的な計画を実行し、拡散モデルが高レベルの意味的ガイダンスと低レベルの視覚的特徴からピクセルをレンダリングする。このアイデアに基づき、我々は動画生成と編集のための統一フレームワークであるBerniniを提案する。MLLMベースのプランナーはターゲットとなる意味表現をViT埋め込み空間で直接予測し、DiTベースのレンダラーはこの計画に条件付けられ、テキスト特徴量、さらに編集のためには詳細を保持するためのソースVAE特徴量によって拡張されて、ピクセルを合成する。意味表現がインターフェースとして機能するため、プランナーとレンダラーは別々に訓練され、軽度の共訓練のみで済み、両コンポーネントの事前訓練された強みを維持しつつ、効率的な訓練を実現する。複数の視覚入力をより適切に処理するために、セグメント認識3D回転位置埋め込み（SA-3D RoPE）を導入し、さらにプランナーにチェーン・オブ・ソート推論を組み込むことで、理解を生成に効果的に転送する。Berniniは、多岐にわたる動画生成・編集ベンチマークにおいて最先端の性能を達成し、MLLMの事前訓練された理解が困難な編集タスクにおける強力な汎化につながっている。

English

Multimodal large language models (MLLMs) and diffusion models have each reached remarkable maturity: MLLMs excel at reasoning over heterogeneous multimodal inputs with strong semantic grounding, while diffusion models synthesize images and videos with photorealistic fidelity. We argue that these two families can be unified through a simple division of labor: MLLMs perform semantic planning, while diffusion models render pixels from high-level semantic guidance and low-level visual features. Building on this idea, we propose Bernini, a unified framework for video generation and editing. An MLLM-based planner predicts the target semantic representation directly in the ViT embedding space, and a DiT-based renderer synthesizes pixels conditioned on this plan, augmented by text features and, for editing, source VAE features for detail preservation. Because semantics serve as the interface, the planner and renderer can be trained separately and only lightly co-trained, preserving the pretrained strengths of both components while keeping training efficient. To better handle multiple visual inputs, we introduce Segment-Aware 3D Rotary Positional Embedding (SA-3D RoPE), and further incorporate chain-of-thought reasoning in the planner to better transfer understanding into generation. Bernini achieves state-of-the-art performance across a wide range of video generation and editing benchmarks, with the MLLM's pretrained understanding translating into strong generalization on challenging editing tasks.