# SkyReels-V3 技術レポート

要旨

動画生成は世界モデル構築の基盤技術であり、多モーダル文脈推論はその能力を定義する重要な試金石である。本論文では、拡散Transformerに基づく統一的多モーダル文脈学習フレームワークを構築し、条件付き動画生成モデルSkyReels-V3を提案する。SkyReels-V3は単一アーキテクチャで3つの核心的生成パラダイムをサポートする：参照画像からの動画合成、動画拡張、音声誘導動画生成である。(i)参照画像からの動画合成モデルは、被写体同一性の強固な保持、時間的一貫性、物語的整合性を備えた高忠実度動画生成を目的とする。参照遵守性と構成的安定性を向上させるため、クロスフレームペアリング、画像編集、意味的書き換えを組み合わせた包括的数据処理パイプラインを設計し、コピー＆ペーストによる不自然な表現を効果的に軽減する。訓練時には、画像と動画のハイブリッド戦略と多解像度共同最適化を採用し、多様なシナリオにおける汎化性と頑健性を向上させる。(ii)動画拡張モデルは、時空間的一貫性モデリングと大規模動画理解を統合し、シームレスな単一ショット継続と、プロ級の映画撮影パターンに基づく知的なマルチショット切替を実現する。(iii)音声連動アバターモデルは、先頭フレーム・末尾フレーム挿入パターンの訓練とキーフレーム推論パラダイムの再構築により、分単位の音声条件付き動画生成を可能とする。視覚的品質を確保した上で、音声と動画の同期性を最適化している。大規模評価により、SkyReels-V3が視覚的品質、指示追従性、特定側面指標を含む主要指標において、最先端またはそれに迫る性能を達成し、主要なクローズドソースシステムに接近していることを実証した。Github: https://github.com/SkyworkAI/SkyReels-V3。

English

Video generation serves as a cornerstone for building world models, where multimodal contextual inference stands as the defining test of capability. In this end, we present SkyReels-V3, a conditional video generation model, built upon a unified multimodal in-context learning framework with diffusion Transformers. SkyReels-V3 model supports three core generative paradigms within a single architecture: reference images-to-video synthesis, video-to-video extension and audio-guided video generation. (i) reference images-to-video model is designed to produce high-fidelity videos with strong subject identity preservation, temporal coherence, and narrative consistency. To enhance reference adherence and compositional stability, we design a comprehensive data processing pipeline that leverages cross frame pairing, image editing, and semantic rewriting, effectively mitigating copy paste artifacts. During training, an image video hybrid strategy combined with multi-resolution joint optimization is employed to improve generalization and robustness across diverse scenarios. (ii) video extension model integrates spatio-temporal consistency modeling with large-scale video understanding, enabling both seamless single-shot continuation and intelligent multi-shot switching with professional cinematographic patterns. (iii) Talking avatar model supports minute-level audio-conditioned video generation by training first-and-last frame insertion patterns and reconstructing key-frame inference paradigms. On the basis of ensuring visual quality, synchronization of audio and videos has been optimized. Extensive evaluations demonstrate that SkyReels-V3 achieves state-of-the-art or near state-of-the-art performance on key metrics including visual quality, instruction following, and specific aspect metrics, approaching leading closed-source systems. Github: https://github.com/SkyworkAI/SkyReels-V3.

# SkyReels-V3 技術レポート

SkyReels-V3 Technique Report

要旨

Support