ChatPaper.aiChatPaper

SkyReels-Audio:全方位音频调节的视频对话肖像 扩散变换器

SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers

June 1, 2025
作者: Zhengcong Fei, Hao Jiang, Di Qiu, Baoxuan Gu, Youqiang Zhang, Jiahua Wang, Jialin Bai, Debang Li, Mingyuan Fan, Guibin Chen, Yahui Zhou
cs.AI

摘要

基於多模態輸入(包括文本、圖像和視頻)引導的音頻條件化說話肖像生成與編輯仍處於探索階段。本文提出SkyReels-Audio,這是一個用於合成高保真且時間連貫的說話肖像視頻的統一框架。基於預訓練的視頻擴散變換器,我們的框架支持無限長度的生成與編輯,同時通過多模態輸入實現多樣化且可控的條件化。我們採用混合課程學習策略,逐步對齊音頻與面部運動,從而實現對長視頻序列的精細多模態控制。為增強局部面部連貫性,我們引入了面部掩碼損失和音頻引導的無分類器指導機制。滑動窗口去噪方法進一步融合了跨時間段的潛在表示,確保了在長時間和多樣化身份下的視覺保真度和時間一致性。更重要的是,我們構建了一個專用的數據管道,用於策劃由同步音頻、視頻和文本描述組成的高質量三元組。全面的基準評估表明,SkyReels-Audio在唇形同步準確性、身份一致性和真實面部動態方面表現優異,特別是在複雜和具有挑戰性的條件下。
English
The generation and editing of audio-conditioned talking portraits guided by multimodal inputs, including text, images, and videos, remains under explored. In this paper, we present SkyReels-Audio, a unified framework for synthesizing high-fidelity and temporally coherent talking portrait videos. Built upon pretrained video diffusion transformers, our framework supports infinite-length generation and editing, while enabling diverse and controllable conditioning through multimodal inputs. We employ a hybrid curriculum learning strategy to progressively align audio with facial motion, enabling fine-grained multimodal control over long video sequences. To enhance local facial coherence, we introduce a facial mask loss and an audio-guided classifier-free guidance mechanism. A sliding-window denoising approach further fuses latent representations across temporal segments, ensuring visual fidelity and temporal consistency across extended durations and diverse identities. More importantly, we construct a dedicated data pipeline for curating high-quality triplets consisting of synchronized audio, video, and textual descriptions. Comprehensive benchmark evaluations show that SkyReels-Audio achieves superior performance in lip-sync accuracy, identity consistency, and realistic facial dynamics, particularly under complex and challenging conditions.
PDF52June 6, 2025