SkyReels-Audio: 비디오 내 오디오 조건화된 대화형 초상화를 위한 디퓨전 트랜스포머

초록

텍스트, 이미지, 비디오를 포함한 다중 모드 입력에 의해 안내되는 오디오 조건화된 말하는 초상화의 생성 및 편집은 아직 충분히 탐구되지 않은 분야이다. 본 논문에서는 고해상도 및 시간적 일관성을 갖춘 말하는 초상화 비디오를 합성하기 위한 통합 프레임워크인 SkyReels-Audio를 제안한다. 사전 학습된 비디오 확산 트랜스포머를 기반으로 구축된 이 프레임워크는 무한 길이의 생성 및 편집을 지원하며, 다중 모드 입력을 통해 다양하고 제어 가능한 조건화를 가능하게 한다. 우리는 오디오와 얼굴 움직임을 점진적으로 정렬하기 위해 하이브리드 커리큘럼 학습 전략을 사용하여 긴 비디오 시퀀스에 대한 세밀한 다중 모드 제어를 가능하게 한다. 얼굴의 지역적 일관성을 향상시키기 위해 얼굴 마스크 손실과 오디오 기반의 분류자 없는 지도 메커니즘을 도입하였다. 또한, 슬라이딩 윈도우 디노이징 접근법을 통해 시간적 세그먼트 간의 잠재적 표현을 융합하여 확장된 기간과 다양한 신원에 걸쳐 시각적 충실도와 시간적 일관성을 보장한다. 더 중요한 것은, 동기화된 오디오, 비디오, 텍스트 설명으로 구성된 고품질 트리플렛을 큐레이션하기 위한 전용 데이터 파이프라인을 구축하였다. 포괄적인 벤치마크 평가를 통해 SkyReels-Audio가 특히 복잡하고 도전적인 조건에서 입술 동기화 정확도, 신원 일관성, 현실적인 얼굴 역학 측면에서 우수한 성능을 달성함을 보여준다.

English

The generation and editing of audio-conditioned talking portraits guided by multimodal inputs, including text, images, and videos, remains under explored. In this paper, we present SkyReels-Audio, a unified framework for synthesizing high-fidelity and temporally coherent talking portrait videos. Built upon pretrained video diffusion transformers, our framework supports infinite-length generation and editing, while enabling diverse and controllable conditioning through multimodal inputs. We employ a hybrid curriculum learning strategy to progressively align audio with facial motion, enabling fine-grained multimodal control over long video sequences. To enhance local facial coherence, we introduce a facial mask loss and an audio-guided classifier-free guidance mechanism. A sliding-window denoising approach further fuses latent representations across temporal segments, ensuring visual fidelity and temporal consistency across extended durations and diverse identities. More importantly, we construct a dedicated data pipeline for curating high-quality triplets consisting of synchronized audio, video, and textual descriptions. Comprehensive benchmark evaluations show that SkyReels-Audio achieves superior performance in lip-sync accuracy, identity consistency, and realistic facial dynamics, particularly under complex and challenging conditions.