Talker-T2AV：基于自回归扩散模型的语音-音视频联合生成

摘要

联合音视频生成模型已证明，统一生成比级联方法能产生更强的跨模态连贯性。然而现有模型通过全局注意力在去噪过程中全程耦合模态，以完全纠缠的方式处理高层语义与底层细节。这对说话头合成并非最优：虽然音频与面部运动存在语义关联，但它们的底层实现（声学信号与视觉纹理）遵循不同的渲染过程。强制所有层级进行联合建模会导致不必要的纠缠并降低效率。我们提出Talker-T2AV——一种自回归扩散框架，其高层跨模态建模在共享骨干网络中完成，而底层细化则使用模态专用解码器。共享的自回归语言模型在统一的块级标记空间中对音视频进行联合推理，两个轻量级扩散Transformer头将隐藏状态解码为帧级音视频潜变量。在说话人像基准测试中，Talker-T2AV在唇同步精度、视频质量与音频质量上均优于双分支基线，实现了比级联流水线更强的跨模态一致性。

English

Joint audio-video generation models have shown that unified generation yields stronger cross-modal coherence than cascaded approaches. However, existing models couple modalities throughout denoising via pervasive attention, treating high-level semantics and low-level details in a fully entangled manner. This is suboptimal for talking head synthesis: while audio and facial motion are semantically correlated, their low-level realizations (acoustic signals and visual textures) follow distinct rendering processes. Enforcing joint modeling across all levels causes unnecessary entanglement and reduces efficiency. We propose Talker-T2AV, an autoregressive diffusion framework where high-level cross-modal modeling occurs in a shared backbone, while low-level refinement uses modality-specific decoders. A shared autoregressive language model jointly reasons over audio and video in a unified patch-level token space. Two lightweight diffusion transformer heads decode the hidden states into frame-level audio and video latents. Experiments on talking portrait benchmarks show Talker-T2AV outperforms dual-branch baselines in lip-sync accuracy, video quality, and audio quality, achieving stronger cross-modal consistency than cascaded pipelines.

Talker-T2AV：基于自回归扩散模型的语音-音视频联合生成

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

摘要

Support