DiTaiListener: 확산 모델 기반의 제어 가능한 고품질 청자 비디오 생성

초록

장시간 상호작용에서 자연스럽고 섬세한 청자 동작을 생성하는 것은 여전히 해결되지 않은 문제로 남아 있습니다. 기존 방법들은 주로 저차원 모션 코드를 활용해 얼굴 동작을 생성한 후 사실적인 렌더링을 적용하는 방식에 의존함으로써 시각적 충실도와 표현적 풍부성 모두에 제한이 있었습니다. 이러한 문제를 해결하기 위해, 우리는 다중모달 조건을 가진 비디오 확산 모델로 구동되는 DiTaiListener를 소개합니다. 우리의 접근 방식은 먼저 DiTaiListener-Gen을 통해 화자의 음성과 얼굴 동작에 조건화된 짧은 청자 반응 세그먼트를 생성합니다. 그런 다음 DiTaiListener-Edit을 통해 전환 프레임을 정제하여 매끄러운 전환을 가능하게 합니다. 구체적으로, DiTaiListener-Gen은 화자의 청각적 및 시각적 단서를 처리하기 위해 Causal Temporal Multimodal Adapter(CTM-Adapter)를 도입하여 Diffusion Transformer(DiT)를 청자 머리 초상화 생성 작업에 적용합니다. CTM-Adapter는 화자의 입력을 시간적으로 일관된 청자 반응을 보장하기 위해 비디오 생성 과정에 인과적 방식으로 통합합니다. 장편 비디오 생성을 위해, 우리는 전환 정제 비디오-투-비디오 확산 모델인 DiTaiListener-Edit을 도입했습니다. 이 모델은 DiTaiListener-Gen에 의해 생성된 짧은 비디오 세그먼트를 병합할 때 얼굴 표정과 이미지 품질의 시간적 일관성을 보장하면서 비디오 세그먼트를 매끄럽고 연속적인 비디오로 융합합니다. 정량적으로, DiTaiListener는 벤치마크 데이터셋에서 사실성(RealTalk에서 FID 기준 +73.8%)과 동작 표현(VICO에서 FD 메트릭 기준 +6.1%) 모두에서 최첨단 성능을 달성했습니다. 사용자 연구는 DiTaiListener의 우수한 성능을 확인하며, 피드백, 다양성, 부드러움 측면에서 경쟁 모델들을 상당한 차이로 앞서는 것으로 나타났습니다.

English

Generating naturalistic and nuanced listener motions for extended interactions remains an open problem. Existing methods often rely on low-dimensional motion codes for facial behavior generation followed by photorealistic rendering, limiting both visual fidelity and expressive richness. To address these challenges, we introduce DiTaiListener, powered by a video diffusion model with multimodal conditions. Our approach first generates short segments of listener responses conditioned on the speaker's speech and facial motions with DiTaiListener-Gen. It then refines the transitional frames via DiTaiListener-Edit for a seamless transition. Specifically, DiTaiListener-Gen adapts a Diffusion Transformer (DiT) for the task of listener head portrait generation by introducing a Causal Temporal Multimodal Adapter (CTM-Adapter) to process speakers' auditory and visual cues. CTM-Adapter integrates speakers' input in a causal manner into the video generation process to ensure temporally coherent listener responses. For long-form video generation, we introduce DiTaiListener-Edit, a transition refinement video-to-video diffusion model. The model fuses video segments into smooth and continuous videos, ensuring temporal consistency in facial expressions and image quality when merging short video segments produced by DiTaiListener-Gen. Quantitatively, DiTaiListener achieves the state-of-the-art performance on benchmark datasets in both photorealism (+73.8% in FID on RealTalk) and motion representation (+6.1% in FD metric on VICO) spaces. User studies confirm the superior performance of DiTaiListener, with the model being the clear preference in terms of feedback, diversity, and smoothness, outperforming competitors by a significant margin.

DiTaiListener: 확산 모델 기반의 제어 가능한 고품질 청자 비디오 생성

DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion

초록

Support