Ovi: オーディオ-ビデオ生成のためのツインバックボーンクロスモーダルフュージョン

要旨

オーディオビデオ生成は、しばしば複雑な多段階アーキテクチャや音声と映像の逐次合成に依存してきた。本論文では、Oviという統一されたパラダイムを提案し、二つのモダリティを単一の生成プロセスとしてモデル化する。Oviは、双生DiTモジュールのブロック単位のクロスモーダル融合を利用することで、自然な同期を実現し、別々のパイプラインや事後の調整を不要にする。細粒度のマルチモーダル融合モデリングを容易にするため、強力な事前学習済みビデオモデルと同一のアーキテクチャを持つオーディオタワーを初期化する。数十万時間の生オーディオデータをゼロから学習することで、オーディオタワーは現実的なサウンドエフェクトや、豊かな話者識別性と感情を伝える音声を生成することを学ぶ。融合は、同一のビデオタワーとオーディオタワーを、大規模なビデオコーパス上で、タイミング（スケーリングされたRoPE埋め込みを介して）と意味（双方向クロスアテンションを通じて）のブロック単位の交換を共同で学習させることで得られる。本モデルは、自然な音声と正確で文脈に合致したサウンドエフェクトを用いた映画級のビデオクリップを生成し、シネマティックなストーリーテリングを可能にする。すべてのデモ、コード、およびモデルウェイトはhttps://aaxwaz.github.io/Oviで公開されている。

English

Audio-video generation has often relied on complex multi-stage architectures or sequential synthesis of sound and visuals. We introduce Ovi, a unified paradigm for audio-video generation that models the two modalities as a single generative process. By using blockwise cross-modal fusion of twin-DiT modules, Ovi achieves natural synchronization and removes the need for separate pipelines or post hoc alignment. To facilitate fine-grained multimodal fusion modeling, we initialize an audio tower with an architecture identical to that of a strong pretrained video model. Trained from scratch on hundreds of thousands of hours of raw audio, the audio tower learns to generate realistic sound effects, as well as speech that conveys rich speaker identity and emotion. Fusion is obtained by jointly training the identical video and audio towers via blockwise exchange of timing (via scaled-RoPE embeddings) and semantics (through bidirectional cross-attention) on a vast video corpus. Our model enables cinematic storytelling with natural speech and accurate, context-matched sound effects, producing movie-grade video clips. All the demos, code and model weights are published at https://aaxwaz.github.io/Ovi