SkyReels-V4：多模态视频-音频生成、修复与编辑模型

摘要

SkyReels V4是一款统一的多模态视频基础模型，具备视频音频联合生成、修复与编辑能力。该模型采用双流多模态扩散Transformer（MMDiT）架构，其中一支流合成视频，另一支流生成时序对齐的音频，同时共享基于多模态大语言模型（MMLM）的强效文本编码器。SkyReels V4支持丰富的多模态指令输入，包括文本、图像、视频片段、掩码和音频参考。通过将MMLM的多模态指令跟随能力与视频分支MMDiT的上下文学习相结合，模型能在复杂条件约束下注入细粒度视觉引导，而音频分支MMDiT可同步利用音频参考指导声音生成。在视频侧，我们采用通道拼接方案，将图像转视频、视频延展、视频编辑等多样化修复类任务统一至单一接口，并通过多模态提示自然扩展至视觉参考的修复与编辑。SkyReels V4最高支持1080p分辨率、32帧/秒、15秒时长，能实现高保真、多镜头、电影级画质的音画同步视频生成。为实现高分辨率长时序生成的计算可行性，我们引入高效策略：联合生成低分辨率全序列与高分辨率关键帧，再通过专用超分模型和帧插值模型处理。据我们所知，SkyReels V4是首个同时支持多模态输入、音视频联合生成、并统一处理生成/修复/编辑任务的视频基础模型，在电影级分辨率与时长下仍保持卓越的效能与质量。

English

SkyReels V4 is a unified multi modal video foundation model for joint video audio generation, inpainting, and editing. The model adopts a dual stream Multimodal Diffusion Transformer (MMDiT) architecture, where one branch synthesizes video and the other generates temporally aligned audio, while sharing a powerful text encoder based on the Multimodal Large Language Models (MMLM). SkyReels V4 accepts rich multi modal instructions, including text, images, video clips, masks, and audio references. By combining the MMLMs multi modal instruction following capability with in context learning in the video branch MMDiT, the model can inject fine grained visual guidance under complex conditioning, while the audio branch MMDiT simultaneously leverages audio references to guide sound generation. On the video side, we adopt a channel concatenation formulation that unifies a wide range of inpainting style tasks, such as image to video, video extension, and video editing under a single interface, and naturally extends to vision referenced inpainting and editing via multi modal prompts. SkyReels V4 supports up to 1080p resolution, 32 FPS, and 15 second duration, enabling high fidelity, multi shot, cinema level video generation with synchronized audio. To make such high resolution, long-duration generation computationally feasible, we introduce an efficiency strategy: Joint generation of low resolution full sequences and high-resolution keyframes, followed by dedicated super-resolution and frame interpolation models. To our knowledge, SkyReels V4 is the first video foundation model that simultaneously supports multi-modal input, joint video audio generation, and a unified treatment of generation, inpainting, and editing, while maintaining strong efficiency and quality at cinematic resolutions and durations.