DiTaiListener: 拡散モデルによる制御可能な高忠実度リスナービデオ生成

要旨

長時間にわたるインタラクションにおける自然でニュアンス豊かなリスナーの動きの生成は、依然として未解決の問題である。既存の手法では、顔の動作生成に低次元のモーションコードを利用し、その後フォトリアリスティックなレンダリングを行うことが多いが、これにより視覚的な忠実度と表現の豊かさが制限されている。これらの課題に対処するため、我々はマルチモーダル条件を備えたビデオ拡散モデルを基盤とするDiTaiListenerを提案する。我々のアプローチでは、まずDiTaiListener-Genを用いて、話者の音声と顔の動きに基づいて短いリスナー応答セグメントを生成する。その後、DiTaiListener-Editを用いて遷移フレームを精緻化し、シームレスな遷移を実現する。具体的には、DiTaiListener-Genは、話者の聴覚的および視覚的キューを因果的に処理するCausal Temporal Multimodal Adapter（CTM-Adapter）を導入することで、リスナーの頭部ポートレート生成タスクにDiffusion Transformer（DiT）を適用する。CTM-Adapterは、話者の入力をビデオ生成プロセスに因果的に統合し、時間的に一貫したリスナー応答を保証する。長時間のビデオ生成のため、我々は遷移精緻化ビデオツービデオ拡散モデルであるDiTaiListener-Editを導入する。このモデルは、ビデオセグメントを滑らかで連続的なビデオに融合し、DiTaiListener-Genによって生成された短いビデオセグメントを統合する際に、顔の表情と画質の時間的一貫性を保証する。定量的には、DiTaiListenerは、ベンチマークデータセットにおいて、フォトリアリズム（RealTalkでのFIDで+73.8%）とモーション表現（VICOでのFDメトリックで+6.1%）の両方で最先端の性能を達成する。ユーザースタディは、DiTaiListenerの優れた性能を確認し、フィードバック、多様性、滑らかさの点で、競合モデルを大きく上回る明確な好みを示している。

English

Generating naturalistic and nuanced listener motions for extended interactions remains an open problem. Existing methods often rely on low-dimensional motion codes for facial behavior generation followed by photorealistic rendering, limiting both visual fidelity and expressive richness. To address these challenges, we introduce DiTaiListener, powered by a video diffusion model with multimodal conditions. Our approach first generates short segments of listener responses conditioned on the speaker's speech and facial motions with DiTaiListener-Gen. It then refines the transitional frames via DiTaiListener-Edit for a seamless transition. Specifically, DiTaiListener-Gen adapts a Diffusion Transformer (DiT) for the task of listener head portrait generation by introducing a Causal Temporal Multimodal Adapter (CTM-Adapter) to process speakers' auditory and visual cues. CTM-Adapter integrates speakers' input in a causal manner into the video generation process to ensure temporally coherent listener responses. For long-form video generation, we introduce DiTaiListener-Edit, a transition refinement video-to-video diffusion model. The model fuses video segments into smooth and continuous videos, ensuring temporal consistency in facial expressions and image quality when merging short video segments produced by DiTaiListener-Gen. Quantitatively, DiTaiListener achieves the state-of-the-art performance on benchmark datasets in both photorealism (+73.8% in FID on RealTalk) and motion representation (+6.1% in FD metric on VICO) spaces. User studies confirm the superior performance of DiTaiListener, with the model being the clear preference in terms of feedback, diversity, and smoothness, outperforming competitors by a significant margin.

DiTaiListener: 拡散モデルによる制御可能な高忠実度リスナービデオ生成

DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion

要旨

Support