생성을 위한 자연스러운 시청각 정렬

초록

공동 오디오-비디오 생성은 시간적으로 동기화되고 의미적으로 일관된 시각-청각 콘텐츠를 합성하는 것을 목표로 한다. 그러나 기존의 오픈소스 방법들은 주로 사후 정렬(posterior alignment)을 적용한 이중 타워(dual-tower) 구조나, 텍스트 맥락, 오디오, 비디오를 하나의 공유 공간에 혼합하는 완전 통합 삼중 모달(unified tri-modal) 설계에 의존한다. 전자는 세밀한 오디오-비디오 공동 진화를 약화시키고, 후자는 의미적 조건화와 저수준 동기화를 결합한다. 이러한 한계를 해결하기 위해, 우리는 공동 오디오-비디오 생성을 위한 NAVA(Native Audio-Visual Alignment) 프레임워크를 제안한다. NAVA는 맥락 조건화된 고유 오디오-비디오 정렬(context-conditioned native audio-visual alignment)을 기반으로 구축된다: 먼저 전용 상호작용 공간에서 오디오-비디오 대응 관계를 설정한 후, 외부 맥락을 사용하여 공동 잡음 제거 과정을 조건화한다. 구체적으로, NAVA는 정렬 후 융합(Align-then-Fuse) MMDiT 아키텍처를 통해 구현되며, 이는 모달 인식 오디오-비디오 정렬에서 모달 공유 공동 잡음 제거로 전환한다. 또한, 우리는 Timbre-in-Context 조건화를 도입하여 기준 음색 단서를 해당 음성 구간에 연관시킴으로써 제어 가능한 음성 음색을 달성한다. Verse-Bench와 Seed-TTS에 대한 실험과 사용자 연구를 통해, NAVA가 단 6.3B 파라미터만으로 우수한 비디오 품질, 정밀한 오디오-비디오 동기화, 경쟁력 있는 오디오 품질, 그리고 더 강력한 기준 음색 제어성을 달성함을 입증한다.

English

Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on either dual-tower designs with posterior alignment or fully unified tri-modal designs that mix textual context, audio and video in one shared space. The former weakens fine-grained audio-video co-evolution, while the latter couples semantic conditioning with low-level synchronization. To address these limitations, we propose NAVA, a Native Audio-Visual Alignment framework for joint audio-video generation. NAVA is built upon context-conditioned native audio-visual alignment: it first establishes audio-video correspondence in a dedicated interaction space, and then uses external context to condition the joint denoising process. Specifically, NAVA is instantiated with an Align-then-Fuse MMDiT architecture, which transitions from modality-aware audio-video alignment to modality-shared joint denoising. Furthermore, we introduce Timbre-in-Context Conditioning to associate reference timbre cues with corresponding speech spans to achieve controllable speech timbre. Experiments on Verse-Bench and Seed-TTS, together with a user study, demonstrate that NAVA achieves superior video quality, precise audio-visual synchronization, competitive audio quality, and stronger reference-timbre controllability using only 6.3B parameters.