Loong: 自己回帰言語モデルを用いた分単位の長いビデオの生成

要旨

数分間のスケールでコンテンツ豊かな長いビデオを生成することは望ましいが、困難です。自己回帰型の大規模言語モデル（LLM）は、自然言語処理の領域でトークンの連続した一貫した長いシーケンスを生成することで大きな成功を収めていますが、自己回帰型LLMを使用したビデオ生成の探索は、数秒の短いビデオを生成するにとどまっています。本研究では、自己回帰型LLMベースのビデオジェネレーターが長いビデオを生成するのを阻む課題について深く分析します。観察と分析に基づいて、私たちはLoongという新しい自己回帰型LLMベースのビデオジェネレーターを提案します。具体的には、テキストトークンとビデオトークンを自己回帰型LLM用に統一されたシーケンスとしてモデル化し、モデルをゼロからトレーニングします。長いビデオのトレーニングにおける損失の不均衡問題を緩和するために、進行的な短いから長いトレーニングと損失再重み付けスキームを提案します。また、ビデオトークンの再符号化やサンプリング戦略などの推論戦略を調査し、推論中のエラー蓄積を減らします。提案されたLoongは、10秒のビデオでトレーニングでき、テキストプロンプトに基づいて分条件付けされた分単位の長いビデオを生成することができます。詳細なサンプルは以下で入手可能です：https://epiphqny.github.io/Loong-video.

English

It is desirable but challenging to generate content-rich long videos in the scale of minutes. Autoregressive large language models (LLMs) have achieved great success in generating coherent and long sequences of tokens in the domain of natural language processing, while the exploration of autoregressive LLMs for video generation is limited to generating short videos of several seconds. In this work, we conduct a deep analysis of the challenges that prevent autoregressive LLM-based video generators from generating long videos. Based on the observations and analysis, we propose Loong, a new autoregressive LLM-based video generator that can generate minute-long videos. Specifically, we model the text tokens and video tokens as a unified sequence for autoregressive LLMs and train the model from scratch. We propose progressive short-to-long training with a loss re-weighting scheme to mitigate the loss imbalance problem for long video training. We further investigate inference strategies, including video token re-encoding and sampling strategies, to diminish error accumulation during inference. Our proposed Loong can be trained on 10-second videos and be extended to generate minute-level long videos conditioned on text prompts, as demonstrated by the results. More samples are available at: https://epiphqny.github.io/Loong-video.