ChatPaper.aiChatPaper

长猫视频技术报告

LongCat-Video Technical Report

October 25, 2025
作者: Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, Tong Zhang
cs.AI

摘要

视频生成是实现世界模型的关键路径,其中高效的长视频推理能力尤为重要。为此,我们推出LongCat-Video——一个拥有136亿参数的基础视频生成模型,在多项视频生成任务中表现卓越。该模型尤其擅长高效生成高质量长视频,是我们构建世界模型的首个里程碑。其核心特性包括:多任务统一架构:基于扩散Transformer(DiT)框架,单一模型即可支持文本生成视频、图像生成视频及视频续写任务;长视频生成能力:通过视频续写任务的预训练,LongCat-Video能在生成数分钟时长视频时保持优异画质与时序连贯性;高效推理机制:采用时空维度由粗到精的生成策略,结合块稀疏注意力机制,可在数分钟内生成720p/30fps视频,尤其在高分辨率下效率优势显著;多奖励强化学习优化:经过多奖励RLHF训练,其性能媲美最新闭源模型及领先开源模型。代码与模型权重已开源,以加速领域研究进展。
English
Video generation is a critical pathway toward world models, with efficient long video inference as a key capability. Toward this end, we introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across multiple video generation tasks. It particularly excels in efficient and high-quality long video generation, representing our first step toward world models. Key features include: Unified architecture for multiple tasks: Built on the Diffusion Transformer (DiT) framework, LongCat-Video supports Text-to-Video, Image-to-Video, and Video-Continuation tasks with a single model; Long video generation: Pretraining on Video-Continuation tasks enables LongCat-Video to maintain high quality and temporal coherence in the generation of minutes-long videos; Efficient inference: LongCat-Video generates 720p, 30fps videos within minutes by employing a coarse-to-fine generation strategy along both the temporal and spatial axes. Block Sparse Attention further enhances efficiency, particularly at high resolutions; Strong performance with multi-reward RLHF: Multi-reward RLHF training enables LongCat-Video to achieve performance on par with the latest closed-source and leading open-source models. Code and model weights are publicly available to accelerate progress in the field.
PDF292December 31, 2025