SkyReels-V3技术报告

摘要

视频生成是构建世界模型的重要基石，而多模态上下文推理能力则是衡量其性能的关键标准。为此，我们推出SkyReels-V3——基于扩散Transformer统一多模态上下文学习框架的条件视频生成模型。该模型在单一架构中支持三大核心生成范式：参考图像到视频合成、视频到视频扩展及音频引导视频生成。（一）参考图像到视频模型通过跨帧配对、图像编辑与语义重写的数据处理流程，有效消除复制粘贴伪影，实现强主体一致性、时序连贯性与叙事连贯性的高保真视频生成。训练阶段采用图像-视频混合策略与多分辨率联合优化，显著提升模型在多场景下的泛化能力与鲁棒性。（二）视频扩展模型融合时空一致性建模与大规模视频理解能力，既可实现无缝单镜头延续，又能依托专业影视级模式进行智能多镜头切换。（三）谈话头像模型通过首尾帧插值训练与关键帧推理范式重构，支持分钟级音频条件视频生成，在保障视觉质量的同时优化了音视频同步效果。大量评估表明，SkyReels-V3在视觉质量、指令跟随及特定维度指标上达到或接近业界最优水平，性能可比肩领先的闭源系统。项目地址：https://github.com/SkyworkAI/SkyReels-V3。

English

Video generation serves as a cornerstone for building world models, where multimodal contextual inference stands as the defining test of capability. In this end, we present SkyReels-V3, a conditional video generation model, built upon a unified multimodal in-context learning framework with diffusion Transformers. SkyReels-V3 model supports three core generative paradigms within a single architecture: reference images-to-video synthesis, video-to-video extension and audio-guided video generation. (i) reference images-to-video model is designed to produce high-fidelity videos with strong subject identity preservation, temporal coherence, and narrative consistency. To enhance reference adherence and compositional stability, we design a comprehensive data processing pipeline that leverages cross frame pairing, image editing, and semantic rewriting, effectively mitigating copy paste artifacts. During training, an image video hybrid strategy combined with multi-resolution joint optimization is employed to improve generalization and robustness across diverse scenarios. (ii) video extension model integrates spatio-temporal consistency modeling with large-scale video understanding, enabling both seamless single-shot continuation and intelligent multi-shot switching with professional cinematographic patterns. (iii) Talking avatar model supports minute-level audio-conditioned video generation by training first-and-last frame insertion patterns and reconstructing key-frame inference paradigms. On the basis of ensuring visual quality, synchronization of audio and videos has been optimized. Extensive evaluations demonstrate that SkyReels-V3 achieves state-of-the-art or near state-of-the-art performance on key metrics including visual quality, instruction following, and specific aspect metrics, approaching leading closed-source systems. Github: https://github.com/SkyworkAI/SkyReels-V3.

SkyReels-V3技术报告

SkyReels-V3 Technique Report

摘要

Support