Talker-TAV: 自己回帰的拡散モデリングによる統合的音声・映像対話生成

要旨

音声・映像の統合生成モデルは、従来のカスケード手法よりも強力なクロスモーダル一貫性を実現することが示されている。しかし、既存モデルは拡散過程全体でモダリティを密結合させており、高レベルの意味情報と低レベルの詳細情報を完全に混在させて扱っている。これは話し手頭部合成において最適とは言えない。音声と顔面動作は意味的には相関するが、その低次元での表現（音響信号と視覚的テクスチャ）は異なる生成プロセスに従うためである。全てのレベルで強制的に共同モデリングを行うことは、不必要な混在を引き起こし効率を低下させる。本研究では、高レベルのクロスモーダルモデリングを共有バックボーンで行い、低レベルの精緻化にはモダリティ特化型デコーダを用いる自己回帰型拡散フレームワーク「Talker-T2AV」を提案する。共有自己回帰言語モデルが、統一されたパッチレベルのトークン空間で音声と映像を共同推論する。2つの軽量な拡散トランスフォーマーヘッドが、隠れ状態をフレームレベルの音声・映像潜在変数にデコードする。話し手肖像ベンチマークによる実験では、Talker-T2AVが口唇同期精度、映像品質、音声品質においてデュアルブランチベースラインを上回り、カスケード型パイプラインよりも優れたクロスモーダル一貫性を達成することを示す。

English

Joint audio-video generation models have shown that unified generation yields stronger cross-modal coherence than cascaded approaches. However, existing models couple modalities throughout denoising via pervasive attention, treating high-level semantics and low-level details in a fully entangled manner. This is suboptimal for talking head synthesis: while audio and facial motion are semantically correlated, their low-level realizations (acoustic signals and visual textures) follow distinct rendering processes. Enforcing joint modeling across all levels causes unnecessary entanglement and reduces efficiency. We propose Talker-T2AV, an autoregressive diffusion framework where high-level cross-modal modeling occurs in a shared backbone, while low-level refinement uses modality-specific decoders. A shared autoregressive language model jointly reasons over audio and video in a unified patch-level token space. Two lightweight diffusion transformer heads decode the hidden states into frame-level audio and video latents. Experiments on talking portrait benchmarks show Talker-T2AV outperforms dual-branch baselines in lip-sync accuracy, video quality, and audio quality, achieving stronger cross-modal consistency than cascaded pipelines.

Talker-TAV: 自己回帰的拡散モデリングによる統合的音声・映像対話生成

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

要旨

Support