Loong:利用自回归语言模型生成分钟级长视频
Loong: Generating Minute-level Long Videos with Autoregressive Language Models
October 3, 2024
作者: Yuqing Wang, Tianwei Xiong, Daquan Zhou, Zhijie Lin, Yang Zhao, Bingyi Kang, Jiashi Feng, Xihui Liu
cs.AI
摘要
在几分钟的时间尺度内生成内容丰富的长视频是令人期待但具有挑战性的。自回归大型语言模型(LLMs)在自然语言处理领域生成连贯且长序列的标记方面取得了巨大成功,而自回归LLMs在视频生成方面的探索仅限于生成几秒钟的短视频。在这项工作中,我们对阻碍基于自回归LLM的视频生成器生成长视频的挑战进行了深入分析。基于观察和分析,我们提出了Loong,这是一个新的基于自回归LLM的视频生成器,可以生成长达一分钟的视频。具体而言,我们将文本标记和视频标记建模为自回归LLMs的统一序列,并从头开始训练模型。我们提出了渐进式的短到长训练,并采用损失重新加权方案来缓解长视频训练中的损失不平衡问题。我们进一步研究了推理策略,包括视频标记重新编码和采样策略,以减少推理过程中的错误累积。我们提出的Loong可以在10秒的视频上进行训练,并可扩展到生成以文本提示为条件的长达一分钟的视频,实验结果证明了这一点。更多样本可在以下链接找到:https://epiphqny.github.io/Loong-video。
English
It is desirable but challenging to generate content-rich long videos in the
scale of minutes. Autoregressive large language models (LLMs) have achieved
great success in generating coherent and long sequences of tokens in the domain
of natural language processing, while the exploration of autoregressive LLMs
for video generation is limited to generating short videos of several seconds.
In this work, we conduct a deep analysis of the challenges that prevent
autoregressive LLM-based video generators from generating long videos. Based on
the observations and analysis, we propose Loong, a new autoregressive LLM-based
video generator that can generate minute-long videos. Specifically, we model
the text tokens and video tokens as a unified sequence for autoregressive LLMs
and train the model from scratch. We propose progressive short-to-long training
with a loss re-weighting scheme to mitigate the loss imbalance problem for long
video training. We further investigate inference strategies, including video
token re-encoding and sampling strategies, to diminish error accumulation
during inference. Our proposed Loong can be trained on 10-second videos and be
extended to generate minute-level long videos conditioned on text prompts, as
demonstrated by the results. More samples are available at:
https://epiphqny.github.io/Loong-video.Summary
AI-Generated Summary