SkyReels-V4：多模態視訊-音訊生成、修補與編輯模型

摘要

SkyReels V4 是一款統一的多元模態影片基礎模型，專注於影片音訊的聯合生成、修補與編輯。該模型採用雙流多模態擴散轉換器（MMDiT）架構，其中一支分支負責合成影片，另一支則生成時間對齊的音訊，同時共享基於多模態大型語言模型（MMLM）的強大文字編碼器。SkyReels V4 可接受豐富的多元模態指令，包括文字、圖像、影片片段、遮罩和音訊參考。通過結合 MMLM 的多模態指令遵循能力與影片分支 MMDiT 的上下文學習，模型能在複雜條件下注入細粒度的視覺引導，而音訊分支 MMDiT 則同步利用音訊參考來指導聲音生成。在影片端，我們採用通道串聯的設計，將圖像轉影片、影片延伸、影片編輯等多種修補類任務統一於單一介面，並透過多元模態提示自然擴展至視覺參考的修補與編輯功能。SkyReels V4 最高支援 1080p 解析度、32 FPS 幀率與 15 秒時長，能實現高擬真度、多鏡頭、電影級別的同步音訊影片生成。為使此高解析度長時序生成具備計算可行性，我們引入效率策略：先聯合生成低解析度完整序列與高解析度關鍵幀，再透過專用超解析度與幀插值模型處理。據我們所知，SkyReels V4 是首個能同時支援多元模態輸入、影片音訊聯合生成，並統一處理生成、修補與編輯任務的影片基礎模型，且在電影級解析度與時長下仍保持卓越效率與品質。

English

SkyReels V4 is a unified multi modal video foundation model for joint video audio generation, inpainting, and editing. The model adopts a dual stream Multimodal Diffusion Transformer (MMDiT) architecture, where one branch synthesizes video and the other generates temporally aligned audio, while sharing a powerful text encoder based on the Multimodal Large Language Models (MMLM). SkyReels V4 accepts rich multi modal instructions, including text, images, video clips, masks, and audio references. By combining the MMLMs multi modal instruction following capability with in context learning in the video branch MMDiT, the model can inject fine grained visual guidance under complex conditioning, while the audio branch MMDiT simultaneously leverages audio references to guide sound generation. On the video side, we adopt a channel concatenation formulation that unifies a wide range of inpainting style tasks, such as image to video, video extension, and video editing under a single interface, and naturally extends to vision referenced inpainting and editing via multi modal prompts. SkyReels V4 supports up to 1080p resolution, 32 FPS, and 15 second duration, enabling high fidelity, multi shot, cinema level video generation with synchronized audio. To make such high resolution, long-duration generation computationally feasible, we introduce an efficiency strategy: Joint generation of low resolution full sequences and high-resolution keyframes, followed by dedicated super-resolution and frame interpolation models. To our knowledge, SkyReels V4 is the first video foundation model that simultaneously supports multi-modal input, joint video audio generation, and a unified treatment of generation, inpainting, and editing, while maintaining strong efficiency and quality at cinematic resolutions and durations.

SkyReels-V4：多模態視訊-音訊生成、修補與編輯模型

SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model

摘要

Support