LongCat-Video 技術報告書

要旨

ビデオ生成は世界モデル構築への重要な道筋であり、効率的な長尺ビデオ推論はその中核的な能力である。本論文では、13.6Bパラメータを有する基盤的ビデオ生成モデル「LongCat-Video」を提案する。本モデルは複数のビデオ生成タスクにおいて強力な性能を発揮し、特に効率的かつ高品質な長尺ビデオ生成に優れており、世界モデル構築への第一歩を表す。主な特徴は以下の通りである： **複数タスク統合アーキテクチャ**：Diffusion Transformer（DiT）フレームワークに基づき、単一モデルでテキスト→ビデオ、画像→ビデオ、ビデオ継続生成タスクをサポート。 **長尺ビデオ生成**：ビデオ継続生成タスクによる事前学習により、数分間の長尺ビデオ生成においても高品質性と時間的一貫性を維持。 **効率的な推論**：時間軸と空間軸の両方で粗密生成戦略を採用し、720p・30fpsのビデオを数分で生成。Block Sparse Attentionにより、高解像度時における効率をさらに向上。 **マルチ報酬RLHFによる強力な性能**：マルチ報酬強化学習による人間フィードバック（RLHF）訓練により、最新のクローズドソースモデル及び主要オープンソースモデルに匹敵する性能を達成。本分野の進展を加速するため、コード及びモデル重みを公開する。

English

Video generation is a critical pathway toward world models, with efficient long video inference as a key capability. Toward this end, we introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across multiple video generation tasks. It particularly excels in efficient and high-quality long video generation, representing our first step toward world models. Key features include: Unified architecture for multiple tasks: Built on the Diffusion Transformer (DiT) framework, LongCat-Video supports Text-to-Video, Image-to-Video, and Video-Continuation tasks with a single model; Long video generation: Pretraining on Video-Continuation tasks enables LongCat-Video to maintain high quality and temporal coherence in the generation of minutes-long videos; Efficient inference: LongCat-Video generates 720p, 30fps videos within minutes by employing a coarse-to-fine generation strategy along both the temporal and spatial axes. Block Sparse Attention further enhances efficiency, particularly at high resolutions; Strong performance with multi-reward RLHF: Multi-reward RLHF training enables LongCat-Video to achieve performance on par with the latest closed-source and leading open-source models. Code and model weights are publicly available to accelerate progress in the field.

LongCat-Video 技術報告書

LongCat-Video Technical Report

要旨

Support