SkyReels-Audio：全方位音频调节的视频对话肖像扩散变换器

摘要

基於多模態輸入（包括文本、圖像和視頻）引導的音頻條件化說話肖像生成與編輯仍處於探索階段。本文提出SkyReels-Audio，這是一個用於合成高保真且時間連貫的說話肖像視頻的統一框架。基於預訓練的視頻擴散變換器，我們的框架支持無限長度的生成與編輯，同時通過多模態輸入實現多樣化且可控的條件化。我們採用混合課程學習策略，逐步對齊音頻與面部運動，從而實現對長視頻序列的精細多模態控制。為增強局部面部連貫性，我們引入了面部掩碼損失和音頻引導的無分類器指導機制。滑動窗口去噪方法進一步融合了跨時間段的潛在表示，確保了在長時間和多樣化身份下的視覺保真度和時間一致性。更重要的是，我們構建了一個專用的數據管道，用於策劃由同步音頻、視頻和文本描述組成的高質量三元組。全面的基準評估表明，SkyReels-Audio在唇形同步準確性、身份一致性和真實面部動態方面表現優異，特別是在複雜和具有挑戰性的條件下。

English

The generation and editing of audio-conditioned talking portraits guided by multimodal inputs, including text, images, and videos, remains under explored. In this paper, we present SkyReels-Audio, a unified framework for synthesizing high-fidelity and temporally coherent talking portrait videos. Built upon pretrained video diffusion transformers, our framework supports infinite-length generation and editing, while enabling diverse and controllable conditioning through multimodal inputs. We employ a hybrid curriculum learning strategy to progressively align audio with facial motion, enabling fine-grained multimodal control over long video sequences. To enhance local facial coherence, we introduce a facial mask loss and an audio-guided classifier-free guidance mechanism. A sliding-window denoising approach further fuses latent representations across temporal segments, ensuring visual fidelity and temporal consistency across extended durations and diverse identities. More importantly, we construct a dedicated data pipeline for curating high-quality triplets consisting of synchronized audio, video, and textual descriptions. Comprehensive benchmark evaluations show that SkyReels-Audio achieves superior performance in lip-sync accuracy, identity consistency, and realistic facial dynamics, particularly under complex and challenging conditions.