Ovi：雙骨幹跨模態融合的音視頻生成技術

摘要

音視頻生成技術往往依賴於複雜的多階段架構或音視訊的順序合成。我們推出了Ovi，這是一種統一的音視頻生成範式，將兩種模態建模為單一的生成過程。通過使用雙DiT模組的塊狀跨模態融合，Ovi實現了自然的同步，並消除了對獨立管道或事後對齊的需求。為了促進細粒度的多模態融合建模，我們初始化了一個音頻塔，其架構與強大的預訓練視頻模型相同。在數十萬小時的原始音頻數據上從零開始訓練，音頻塔學會了生成逼真的音效，以及傳達豐富說話者身份和情感的語音。融合是通過在龐大的視頻語料庫上，通過塊狀交換時間（通過縮放RoPE嵌入）和語義（通過雙向交叉注意力）來聯合訓練相同的視頻和音頻塔來實現的。我們的模型能夠實現帶有自然語音和精確、上下文匹配音效的電影級敘事，生成電影級別的視頻片段。所有演示、代碼和模型權重均已發布於https://aaxwaz.github.io/Ovi。

English

Audio-video generation has often relied on complex multi-stage architectures or sequential synthesis of sound and visuals. We introduce Ovi, a unified paradigm for audio-video generation that models the two modalities as a single generative process. By using blockwise cross-modal fusion of twin-DiT modules, Ovi achieves natural synchronization and removes the need for separate pipelines or post hoc alignment. To facilitate fine-grained multimodal fusion modeling, we initialize an audio tower with an architecture identical to that of a strong pretrained video model. Trained from scratch on hundreds of thousands of hours of raw audio, the audio tower learns to generate realistic sound effects, as well as speech that conveys rich speaker identity and emotion. Fusion is obtained by jointly training the identical video and audio towers via blockwise exchange of timing (via scaled-RoPE embeddings) and semantics (through bidirectional cross-attention) on a vast video corpus. Our model enables cinematic storytelling with natural speech and accurate, context-matched sound effects, producing movie-grade video clips. All the demos, code and model weights are published at https://aaxwaz.github.io/Ovi

Ovi：雙骨幹跨模態融合的音視頻生成技術

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

摘要

Support