Lumos-1: 통합 모델 관점에서의 자기회귀적 비디오 생성 연구

초록

자기회귀적 대형 언어 모델(LLMs)은 다양한 언어 작업을 통합하면서, 자기회귀적 비디오 생성에 대한 초기 연구를 촉발시켰습니다. 기존의 자기회귀적 비디오 생성기는 표준 LLM 아키텍처와 다르거나, 부피가 큰 외부 텍스트 인코더에 의존하거나, 다음 토큰 디코딩으로 인해 과도한 지연 시간을 초래하는 문제가 있었습니다. 본 논문에서는 LLM 아키텍처를 최소한의 수정으로 유지한 자기회귀적 비디오 생성기인 Lumos-1을 소개합니다. LLM에 시공간적 상관관계를 주입하기 위해, 우리는 3D RoPE의 효과를 확인하고 그 불균형적인 주파수 스펙트럼 범위를 진단했습니다. 이를 바탕으로, 원래의 텍스트 RoPE를 보존하면서 다중 모달 시공간 데이터 모델링을 위한 포괄적인 주파수 스펙트럼과 스케일링된 3D 위치를 제공하는 MM-RoPE 방식을 제안합니다. 또한, Lumos-1은 프레임 내 양방향성과 프레임 간 시간적 인과성을 따르는 토큰 의존성 전략을 채택합니다. 이 의존성 전략을 바탕으로, 공간 정보 중복으로 인한 프레임별 손실 불균형 문제를 식별하고, 이를 해결하기 위해 자기회귀적 이산 확산 강제(AR-DF)를 제안합니다. AR-DF는 학습 중에 시간적 튜브 마스킹을 도입하고, 호환 가능한 추론 시 마스킹 정책을 사용하여 품질 저하를 방지합니다. 메모리 효율적인 학습 기법을 사용하여, 우리는 단 48개의 GPU로 Lumos-1을 사전 학습시켰으며, GenEval에서 EMU3, VBench-I2V에서 COSMOS-Video2World, VBench-T2V에서 OpenSoraPlan과 비슷한 성능을 달성했습니다. 코드와 모델은 https://github.com/alibaba-damo-academy/Lumos에서 확인할 수 있습니다.

English

Autoregressive large language models (LLMs) have unified a vast range of language tasks, inspiring preliminary efforts in autoregressive video generation. Existing autoregressive video generators either diverge from standard LLM architectures, depend on bulky external text encoders, or incur prohibitive latency due to next-token decoding. In this paper, we introduce Lumos-1, an autoregressive video generator that retains the LLM architecture with minimal architectural modifications. To inject spatiotemporal correlations in LLMs, we identify the efficacy of incorporating 3D RoPE and diagnose its imbalanced frequency spectrum ranges. Therefore, we propose MM-RoPE, a RoPE scheme that preserves the original textual RoPE while providing comprehensive frequency spectra and scaled 3D positions for modeling multimodal spatiotemporal data. Moreover, Lumos-1 resorts to a token dependency strategy that obeys intra-frame bidirectionality and inter-frame temporal causality. Based on this dependency strategy, we identify the issue of frame-wise loss imbalance caused by spatial information redundancy and solve it by proposing Autoregressive Discrete Diffusion Forcing (AR-DF). AR-DF introduces temporal tube masking during training with a compatible inference-time masking policy to avoid quality degradation. By using memory-efficient training techniques, we pre-train Lumos-1 on only 48 GPUs, achieving performance comparable to EMU3 on GenEval, COSMOS-Video2World on VBench-I2V, and OpenSoraPlan on VBench-T2V. Code and models are available at https://github.com/alibaba-damo-academy/Lumos.

Lumos-1: 통합 모델 관점에서의 자기회귀적 비디오 생성 연구

Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective

초록

Support