长猫视频技术报告

摘要

视频生成是实现世界模型的关键路径，其中高效的长视频推理能力尤为重要。为此，我们推出LongCat-Video——一个拥有136亿参数的基础视频生成模型，在多项视频生成任务中表现卓越。该模型尤其擅长高效生成高质量长视频，是我们构建世界模型的首个里程碑。其核心特性包括：多任务统一架构：基于扩散Transformer（DiT）框架，单一模型即可支持文本生成视频、图像生成视频及视频续写任务；长视频生成能力：通过视频续写任务的预训练，LongCat-Video能在生成数分钟时长视频时保持优异画质与时序连贯性；高效推理机制：采用时空维度由粗到精的生成策略，结合块稀疏注意力机制，可在数分钟内生成720p/30fps视频，尤其在高分辨率下效率优势显著；多奖励强化学习优化：经过多奖励RLHF训练，其性能媲美最新闭源模型及领先开源模型。代码与模型权重已开源，以加速领域研究进展。

English

Video generation is a critical pathway toward world models, with efficient long video inference as a key capability. Toward this end, we introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across multiple video generation tasks. It particularly excels in efficient and high-quality long video generation, representing our first step toward world models. Key features include: Unified architecture for multiple tasks: Built on the Diffusion Transformer (DiT) framework, LongCat-Video supports Text-to-Video, Image-to-Video, and Video-Continuation tasks with a single model; Long video generation: Pretraining on Video-Continuation tasks enables LongCat-Video to maintain high quality and temporal coherence in the generation of minutes-long videos; Efficient inference: LongCat-Video generates 720p, 30fps videos within minutes by employing a coarse-to-fine generation strategy along both the temporal and spatial axes. Block Sparse Attention further enhances efficiency, particularly at high resolutions; Strong performance with multi-reward RLHF: Multi-reward RLHF training enables LongCat-Video to achieve performance on par with the latest closed-source and leading open-source models. Code and model weights are publicly available to accelerate progress in the field.

长猫视频技术报告

LongCat-Video Technical Report

摘要

Support