SkyReels-V4：マルチモーダルな映像・音声生成、インペインティングおよび編集モデル

要旨

SkyReels V4は、映像と音声の共同生成、インペインティング、編集を統合的に行うマルチモーダル動画基盤モデルです。本モデルはデュアルストリームのマルチモーダル拡散トランスフォーマー（MMDiT）アーキテクチャを採用しており、一方のブランチが映像を合成し、もう一方のブランチが時間的に同期した音声を生成すると同時に、マルチモーダル大規模言語モデル（MMLM）に基づく強力なテキストエンコーダを共有しています。SkyReels V4は、テキスト、画像、動画クリップ、マスク、音声リファレンスを含む豊富なマルチモーダル指示を受け入れます。MMLMのマルチモーダル指示追従能力と、映像ブランチMMDiTにおける文脈内学習を組み合わせることで、複雑な条件付けの下できめ細かい視覚的ガイダンスを注入可能にし、同時に音声ブランチMMDiTが音声リファレンスを活用して音響生成を誘導します。映像側では、画像から動画への生成、動画延長、動画編集といった多様なインペインティングスタイルのタスクを単一インターフェースに統合するチャネル連結方式を採用し、マルチモーダルプロンプトを通じた視覚参照型のインペインティングと編集へ自然に拡張します。SkyReels V4は最大1080p解像度、32FPS、15秒間の生成をサポートし、高精細でマルチショット、映画レベルの映像と同期した音声の生成を実現します。この高解像度・長時間生成を計算量的に実現可能とするため、低解像度の全シーケンスと高解像度キーフレームの共同生成を行った後、専用の超解像モデルとフレーム補間モデルを適用する効率化戦略を導入しました。私たちの知る限り、SkyReels V4はマルチモーダル入力、映像・音声の共同生成、生成・インペインティング・編集の統合的な処理を同時にサポートし、かつ映画級の解像度と長さで強力な効率性と品質を維持する初めての動画基盤モデルです。

English

SkyReels V4 is a unified multi modal video foundation model for joint video audio generation, inpainting, and editing. The model adopts a dual stream Multimodal Diffusion Transformer (MMDiT) architecture, where one branch synthesizes video and the other generates temporally aligned audio, while sharing a powerful text encoder based on the Multimodal Large Language Models (MMLM). SkyReels V4 accepts rich multi modal instructions, including text, images, video clips, masks, and audio references. By combining the MMLMs multi modal instruction following capability with in context learning in the video branch MMDiT, the model can inject fine grained visual guidance under complex conditioning, while the audio branch MMDiT simultaneously leverages audio references to guide sound generation. On the video side, we adopt a channel concatenation formulation that unifies a wide range of inpainting style tasks, such as image to video, video extension, and video editing under a single interface, and naturally extends to vision referenced inpainting and editing via multi modal prompts. SkyReels V4 supports up to 1080p resolution, 32 FPS, and 15 second duration, enabling high fidelity, multi shot, cinema level video generation with synchronized audio. To make such high resolution, long-duration generation computationally feasible, we introduce an efficiency strategy: Joint generation of low resolution full sequences and high-resolution keyframes, followed by dedicated super-resolution and frame interpolation models. To our knowledge, SkyReels V4 is the first video foundation model that simultaneously supports multi-modal input, joint video audio generation, and a unified treatment of generation, inpainting, and editing, while maintaining strong efficiency and quality at cinematic resolutions and durations.

SkyReels-V4：マルチモーダルな映像・音声生成、インペインティングおよび編集モデル

SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model

要旨

Support