롱캣-비디오 기술 보고서

초록

비디오 생성은 월드 모델로 나아가는 중요한 경로이며, 효율적인 장기간 비디오 추론은 핵심 능력입니다. 이를 위해 우리는 13.6B 매개변수를 가진 기초 비디오 생성 모델인 LongCat-Video를 소개합니다. 이 모델은 여러 비디오 생성 작업에서 강력한 성능을 제공하며, 특히 효율적이고 고품질의 긴 비디오 생성에서 뛰어나 월드 모델을 향한 첫 걸음을 내디뎠습니다. 주요 특징은 다음과 같습니다: 다양한 작업을 위한 통합 아키텍처: Diffusion Transformer(DiT) 프레임워크를 기반으로 하는 LongCat-Video는 단일 모델로 텍스트-비디오, 이미지-비디오, 비디오 연속 생성 작업을 지원합니다; 긴 비디오 생성: 비디오 연속 생성 작업에 대한 사전 학습을 통해 LongCat-Video는 수 분 길이의 비디오 생성에서도 높은 품질과 시간적 일관성을 유지합니다; 효율적인 추론: LongCat-Video는 시간축과 공간축을 따라 coarse-to-fine 생성 전략을 채택하여 720p, 30fps 비디오를 수 분 내에 생성합니다. Block Sparse Attention은 특히 고해상도에서 효율성을 더욱 향상시킵니다; 다중 보상 RLHF를 통한 강력한 성능: 다중 보상 RLHF 훈련을 통해 LongCat-Video는 최신의 클로즈드 소스 및 선도적인 오픈 소스 모델들과 동등한 성능을 달성합니다. 해당 분야의 발전을 가속화하기 위해 코드와 모델 가중치를 공개합니다.

English

Video generation is a critical pathway toward world models, with efficient long video inference as a key capability. Toward this end, we introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across multiple video generation tasks. It particularly excels in efficient and high-quality long video generation, representing our first step toward world models. Key features include: Unified architecture for multiple tasks: Built on the Diffusion Transformer (DiT) framework, LongCat-Video supports Text-to-Video, Image-to-Video, and Video-Continuation tasks with a single model; Long video generation: Pretraining on Video-Continuation tasks enables LongCat-Video to maintain high quality and temporal coherence in the generation of minutes-long videos; Efficient inference: LongCat-Video generates 720p, 30fps videos within minutes by employing a coarse-to-fine generation strategy along both the temporal and spatial axes. Block Sparse Attention further enhances efficiency, particularly at high resolutions; Strong performance with multi-reward RLHF: Multi-reward RLHF training enables LongCat-Video to achieve performance on par with the latest closed-source and leading open-source models. Code and model weights are publicly available to accelerate progress in the field.