原生視聽對齊生成

摘要

聯合音頻-視頻生成旨在合成時間同步且語義連貫的視覺-聲學內容。然而，現有開源方法主要依賴於兩種設計：要麼採用帶有後驗對齊的雙塔架構，要麼採用將文本上下文、音頻和視頻混合在共享空間中的完全統一三模態設計。前者削弱了細粒度的音頻-視頻共同演化，後者則將語義條件與低層級同步耦合在一起。為了解決這些限制，我們提出了NAVA，一種用於聯合音頻-視頻生成的原生視聽對齊框架。NAVA基於上下文條件下的原生視聽對齊構建：首先在專用的交互空間中建立音頻-視頻對應關係，然後利用外部上下文來調節聯合去噪過程。具體來說，NAVA採用對齊後融合的MMDiT架構來實現，該架構從模態感知的音頻-視頻對齊過渡到模態共享的聯合去噪。此外，我們引入了上下文中的音色條件機制，將參考音色線索與對應的語音片段關聯起來，以實現可控的語音音色。在Verse-Bench和Seed-TTS上的實驗以及一項用戶研究表明，NAVA僅使用6.3B參數即可實現卓越的視頻質量、精確的視聽同步、具競爭力的音頻質量以及更強的參考音色可控性。

English

Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on either dual-tower designs with posterior alignment or fully unified tri-modal designs that mix textual context, audio and video in one shared space. The former weakens fine-grained audio-video co-evolution, while the latter couples semantic conditioning with low-level synchronization. To address these limitations, we propose NAVA, a Native Audio-Visual Alignment framework for joint audio-video generation. NAVA is built upon context-conditioned native audio-visual alignment: it first establishes audio-video correspondence in a dedicated interaction space, and then uses external context to condition the joint denoising process. Specifically, NAVA is instantiated with an Align-then-Fuse MMDiT architecture, which transitions from modality-aware audio-video alignment to modality-shared joint denoising. Furthermore, we introduce Timbre-in-Context Conditioning to associate reference timbre cues with corresponding speech spans to achieve controllable speech timbre. Experiments on Verse-Bench and Seed-TTS, together with a user study, demonstrate that NAVA achieves superior video quality, precise audio-visual synchronization, competitive audio quality, and stronger reference-timbre controllability using only 6.3B parameters.