ChatPaper.aiChatPaper

SkyReels-V3技术报告

SkyReels-V3 Technique Report

January 24, 2026
作者: Debang Li, Zhengcong Fei, Tuanhui Li, Yikun Dou, Zheng Chen, Jiangping Yang, Mingyuan Fan, Jingtao Xu, Jiahua Wang, Baoxuan Gu, Mingshan Chang, Yuqiang Xie, Binjie Mao, Youqiang Zhang, Nuo Pang, Hao Zhang, Yuzhe Jin, Zhiheng Xu, Dixuan Lin, Guibin Chen, Yahui Zhou
cs.AI

摘要

视频生成是构建世界模型的重要基石,而多模态上下文推理能力则是衡量其性能的关键标准。为此,我们推出SkyReels-V3——基于扩散Transformer统一多模态上下文学习框架的条件视频生成模型。该模型在单一架构中支持三大核心生成范式:参考图像到视频合成、视频到视频扩展及音频引导视频生成。(一)参考图像到视频模型通过跨帧配对、图像编辑与语义重写的数据处理流程,有效消除复制粘贴伪影,实现强主体一致性、时序连贯性与叙事连贯性的高保真视频生成。训练阶段采用图像-视频混合策略与多分辨率联合优化,显著提升模型在多场景下的泛化能力与鲁棒性。(二)视频扩展模型融合时空一致性建模与大规模视频理解能力,既可实现无缝单镜头延续,又能依托专业影视级模式进行智能多镜头切换。(三)谈话头像模型通过首尾帧插值训练与关键帧推理范式重构,支持分钟级音频条件视频生成,在保障视觉质量的同时优化了音视频同步效果。 大量评估表明,SkyReels-V3在视觉质量、指令跟随及特定维度指标上达到或接近业界最优水平,性能可比肩领先的闭源系统。项目地址:https://github.com/SkyworkAI/SkyReels-V3。
English
Video generation serves as a cornerstone for building world models, where multimodal contextual inference stands as the defining test of capability. In this end, we present SkyReels-V3, a conditional video generation model, built upon a unified multimodal in-context learning framework with diffusion Transformers. SkyReels-V3 model supports three core generative paradigms within a single architecture: reference images-to-video synthesis, video-to-video extension and audio-guided video generation. (i) reference images-to-video model is designed to produce high-fidelity videos with strong subject identity preservation, temporal coherence, and narrative consistency. To enhance reference adherence and compositional stability, we design a comprehensive data processing pipeline that leverages cross frame pairing, image editing, and semantic rewriting, effectively mitigating copy paste artifacts. During training, an image video hybrid strategy combined with multi-resolution joint optimization is employed to improve generalization and robustness across diverse scenarios. (ii) video extension model integrates spatio-temporal consistency modeling with large-scale video understanding, enabling both seamless single-shot continuation and intelligent multi-shot switching with professional cinematographic patterns. (iii) Talking avatar model supports minute-level audio-conditioned video generation by training first-and-last frame insertion patterns and reconstructing key-frame inference paradigms. On the basis of ensuring visual quality, synchronization of audio and videos has been optimized. Extensive evaluations demonstrate that SkyReels-V3 achieves state-of-the-art or near state-of-the-art performance on key metrics including visual quality, instruction following, and specific aspect metrics, approaching leading closed-source systems. Github: https://github.com/SkyworkAI/SkyReels-V3.
PDF50January 28, 2026