长猫视频技术报告
LongCat-Video Technical Report
October 25, 2025
作者: Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, Tong Zhang
cs.AI
摘要
视频生成是实现世界模型的关键路径,其中高效的长视频推理能力尤为重要。为此,我们推出LongCat-Video——一个拥有136亿参数的基础视频生成模型,在多项视频生成任务中均展现强大性能。该模型尤其擅长高效生成高质量长视频,是我们构建世界模型的首步探索。其核心特性包括:多任务统一架构:基于扩散Transformer(DiT)框架,单一模型即可支持文本生成视频、图像生成视频及视频续写任务;长视频生成能力:通过视频续写任务的预训练,LongCat-Video能保持数分钟长视频生成的高质量与时序连贯性;高效推理机制:采用时空维度由粗到精的生成策略,结合区块稀疏注意力技术,可在数分钟内生成720p/30fps视频,尤其在高分辨率下显著提升效率;多奖励强化学习优化:经过多奖励RLHF训练,模型性能媲美最新闭源及领先开源模型。为加速领域发展,代码与模型权重已全面开源。
English
Video generation is a critical pathway toward world models, with efficient
long video inference as a key capability. Toward this end, we introduce
LongCat-Video, a foundational video generation model with 13.6B parameters,
delivering strong performance across multiple video generation tasks. It
particularly excels in efficient and high-quality long video generation,
representing our first step toward world models. Key features include: Unified
architecture for multiple tasks: Built on the Diffusion Transformer (DiT)
framework, LongCat-Video supports Text-to-Video, Image-to-Video, and
Video-Continuation tasks with a single model; Long video generation:
Pretraining on Video-Continuation tasks enables LongCat-Video to maintain high
quality and temporal coherence in the generation of minutes-long videos;
Efficient inference: LongCat-Video generates 720p, 30fps videos within minutes
by employing a coarse-to-fine generation strategy along both the temporal and
spatial axes. Block Sparse Attention further enhances efficiency, particularly
at high resolutions; Strong performance with multi-reward RLHF: Multi-reward
RLHF training enables LongCat-Video to achieve performance on par with the
latest closed-source and leading open-source models. Code and model weights are
publicly available to accelerate progress in the field.