SkyReels-A2:在视频扩散变换器中实现任意内容合成
SkyReels-A2: Compose Anything in Video Diffusion Transformers
April 3, 2025
作者: Zhengcong Fei, Debang Li, Di Qiu, Jiahua Wang, Yikun Dou, Rui Wang, Jingtao Xu, Mingyuan Fan, Guibin Chen, Yang Li, Yahui Zhou
cs.AI
摘要
本文提出SkyReels-A2,這是一個可控的視頻生成框架,能夠根據文本提示將任意視覺元素(如角色、物體、背景)組合成合成視頻,同時嚴格保持與每個元素的參考圖像的一致性。我們將此任務稱為元素到視頻(E2V),其主要挑戰在於保持每個參考元素的保真度、確保場景的連貫構圖以及實現自然的輸出。為解決這些問題,我們首先設計了一個全面的數據管道來構建用於模型訓練的提示-參考-視頻三元組。接著,我們提出了一種新穎的圖像-文本聯合嵌入模型,將多元素表示注入生成過程,平衡元素特定的一致性與全局連貫性和文本對齊。我們還優化了推理管道,以提高速度和輸出穩定性。此外,我們引入了一個精心策劃的基準進行系統評估,即A2 Bench。實驗表明,我們的框架能夠生成多樣化、高質量的視頻,並實現精確的元素控制。SkyReels-A2是首個開源的商業級E2V生成模型,其表現優於先進的閉源商業模型。我們預計SkyReels-A2將推動戲劇和虛擬電子商務等創意應用的發展,推動可控視頻生成的邊界。
English
This paper presents SkyReels-A2, a controllable video generation framework
capable of assembling arbitrary visual elements (e.g., characters, objects,
backgrounds) into synthesized videos based on textual prompts while maintaining
strict consistency with reference images for each element. We term this task
elements-to-video (E2V), whose primary challenges lie in preserving the
fidelity of each reference element, ensuring coherent composition of the scene,
and achieving natural outputs. To address these, we first design a
comprehensive data pipeline to construct prompt-reference-video triplets for
model training. Next, we propose a novel image-text joint embedding model to
inject multi-element representations into the generative process, balancing
element-specific consistency with global coherence and text alignment. We also
optimize the inference pipeline for both speed and output stability. Moreover,
we introduce a carefully curated benchmark for systematic evaluation, i.e, A2
Bench. Experiments demonstrate that our framework can generate diverse,
high-quality videos with precise element control. SkyReels-A2 is the first
open-source commercial grade model for the generation of E2V, performing
favorably against advanced closed-source commercial models. We anticipate
SkyReels-A2 will advance creative applications such as drama and virtual
e-commerce, pushing the boundaries of controllable video generation.Summary
AI-Generated Summary