Ovi: 오디오-비디오 생성을 위한 트윈 백본 크로스모달 퓨전

초록

오디오-비디오 생성은 종종 복잡한 다단계 아키텍처나 소리와 영상을 순차적으로 합성하는 방식에 의존해 왔습니다. 우리는 Ovi를 소개하며, 이는 두 가지 모달리티를 단일 생성 프로세스로 모델링하는 통합 패러다임입니다. Ovi는 twin-DiT 모듈의 블록 단위 교차 모달 융합을 사용하여 자연스러운 동기화를 달성하고 별도의 파이프라인이나 사후 정렬의 필요성을 제거합니다. 세밀한 다중 모달 융합 모델링을 용이하게 하기 위해, 우리는 강력한 사전 학습된 비디오 모델과 동일한 아키텍처로 오디오 타워를 초기화합니다. 수십만 시간의 원시 오디오 데이터를 처음부터 학습함으로써, 오디오 타워는 현실적인 사운드 효과와 풍부한 화자 정체성과 감정을 전달하는 음성을 생성하는 법을 배웁니다. 융합은 동일한 비디오와 오디오 타워를 방대한 비디오 코퍼스에서 블록 단위로 타이밍(스케일링된 RoPE 임베딩을 통해)과 의미(양방향 교차 주의를 통해)를 교환하며 공동 학습함으로써 얻어집니다. 우리의 모델은 자연스러운 음성과 정확한, 맥락에 맞는 사운드 효과를 통해 영화급 비디오 클립을 생성하며, 시네마틱 스토리텔링을 가능하게 합니다. 모든 데모, 코드 및 모델 가중치는 https://aaxwaz.github.io/Ovi에서 공개되었습니다.

English

Audio-video generation has often relied on complex multi-stage architectures or sequential synthesis of sound and visuals. We introduce Ovi, a unified paradigm for audio-video generation that models the two modalities as a single generative process. By using blockwise cross-modal fusion of twin-DiT modules, Ovi achieves natural synchronization and removes the need for separate pipelines or post hoc alignment. To facilitate fine-grained multimodal fusion modeling, we initialize an audio tower with an architecture identical to that of a strong pretrained video model. Trained from scratch on hundreds of thousands of hours of raw audio, the audio tower learns to generate realistic sound effects, as well as speech that conveys rich speaker identity and emotion. Fusion is obtained by jointly training the identical video and audio towers via blockwise exchange of timing (via scaled-RoPE embeddings) and semantics (through bidirectional cross-attention) on a vast video corpus. Our model enables cinematic storytelling with natural speech and accurate, context-matched sound effects, producing movie-grade video clips. All the demos, code and model weights are published at https://aaxwaz.github.io/Ovi

Ovi: 오디오-비디오 생성을 위한 트윈 백본 크로스모달 퓨전

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

초록

Support