Talker-T2AV：基于自回归扩散模型的联合音视频生成技术

摘要

聯合音視頻生成模型已證實，統合式生成方法相比級聯式能產生更強的跨模態連貫性。然而現有模型通過全局注意力在整個去噪過程中耦合模態，以完全糾纏的方式處理高層語義與低層細節。這種方式對頭像說話合成並非最優：雖然音頻與面部運動存在語義關聯，但它們的低層實現（聲學信號與視覺紋理）遵循不同的渲染邏輯。強制在所有層級進行聯合建模會導致不必要的糾纏並降低效率。我們提出Talker-T2AV——一種高層跨模態建模在共享骨幹中進行，而低層細化使用模態專用解碼器的自回歸擴散框架。共享的自回歸語言模型在統一的塊級標記空間中對音視頻進行聯合推理，兩個輕量級擴散變壓器頭將隱藏狀態解碼為幀級音視頻潛變量。在說話人像基準測試中，Talker-T2AV在唇語同步精度、視頻質量與音質方面均超越雙分支基線模型，實現了比級聯流水線更強的跨模態一致性。

English

Joint audio-video generation models have shown that unified generation yields stronger cross-modal coherence than cascaded approaches. However, existing models couple modalities throughout denoising via pervasive attention, treating high-level semantics and low-level details in a fully entangled manner. This is suboptimal for talking head synthesis: while audio and facial motion are semantically correlated, their low-level realizations (acoustic signals and visual textures) follow distinct rendering processes. Enforcing joint modeling across all levels causes unnecessary entanglement and reduces efficiency. We propose Talker-T2AV, an autoregressive diffusion framework where high-level cross-modal modeling occurs in a shared backbone, while low-level refinement uses modality-specific decoders. A shared autoregressive language model jointly reasons over audio and video in a unified patch-level token space. Two lightweight diffusion transformer heads decode the hidden states into frame-level audio and video latents. Experiments on talking portrait benchmarks show Talker-T2AV outperforms dual-branch baselines in lip-sync accuracy, video quality, and audio quality, achieving stronger cross-modal consistency than cascaded pipelines.

Talker-T2AV：基于自回归扩散模型的联合音视频生成技术

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

摘要

Support