ChatPaper.aiChatPaper

Ovi:双骨干跨模态融合的音频-视频生成技术

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

September 30, 2025
作者: Chetwin Low, Weimin Wang, Calder Katyal
cs.AI

摘要

音视频生成技术传统上依赖于复杂的多阶段架构或音画顺序合成。我们提出了Ovi,一种统一的音视频生成范式,将两种模态建模为单一的生成过程。通过采用双DiT模块的块状跨模态融合,Ovi实现了自然的同步效果,无需独立的处理流程或事后对齐。为了促进细粒度的多模态融合建模,我们初始化了一个音频塔,其架构与强大的预训练视频模型相同。该音频塔在数十万小时的原始音频数据上从头训练,学会了生成逼真的音效,以及传达丰富说话者身份和情感的语音。融合是通过在庞大的视频语料库上,通过块状交换时间信息(通过缩放RoPE嵌入)和语义信息(通过双向交叉注意力)联合训练相同的视频和音频塔实现的。我们的模型能够实现带有自然语音和精准、情境匹配音效的电影级叙事,生成电影品质的视频片段。所有演示、代码及模型权重均已发布于https://aaxwaz.github.io/Ovi。
English
Audio-video generation has often relied on complex multi-stage architectures or sequential synthesis of sound and visuals. We introduce Ovi, a unified paradigm for audio-video generation that models the two modalities as a single generative process. By using blockwise cross-modal fusion of twin-DiT modules, Ovi achieves natural synchronization and removes the need for separate pipelines or post hoc alignment. To facilitate fine-grained multimodal fusion modeling, we initialize an audio tower with an architecture identical to that of a strong pretrained video model. Trained from scratch on hundreds of thousands of hours of raw audio, the audio tower learns to generate realistic sound effects, as well as speech that conveys rich speaker identity and emotion. Fusion is obtained by jointly training the identical video and audio towers via blockwise exchange of timing (via scaled-RoPE embeddings) and semantics (through bidirectional cross-attention) on a vast video corpus. Our model enables cinematic storytelling with natural speech and accurate, context-matched sound effects, producing movie-grade video clips. All the demos, code and model weights are published at https://aaxwaz.github.io/Ovi
PDF264October 3, 2025