ChatPaper.aiChatPaper

TokensGen:利用压缩令牌实现长视频生成

TokensGen: Harnessing Condensed Tokens for Long Video Generation

July 21, 2025
作者: Wenqi Ouyang, Zeqi Xiao, Danni Yang, Yifan Zhou, Shuai Yang, Lei Yang, Jianlou Si, Xingang Pan
cs.AI

摘要

生成连贯的长视频是一项复杂的挑战:尽管基于扩散的生成模型能够生成视觉效果出色的短视频片段,但将其扩展到更长时长时,往往会导致内存瓶颈和长期不一致性问题。本文提出了一种新颖的两阶段框架——TokensGen,通过利用压缩的语义标记来解决这些问题。我们的方法将长视频生成分解为三个核心任务:(1) 片段内语义控制,(2) 长期一致性控制,以及(3) 片段间平滑过渡。首先,我们训练了To2V(标记到视频),这是一个由文本和视频标记引导的短视频扩散模型,配合视频标记器将短视频片段压缩为富含语义的标记。其次,我们引入了T2To(文本到标记),这是一种视频标记扩散变换器,能够一次性生成所有标记,确保跨片段的全局一致性。最后,在推理阶段,采用自适应FIFO-Diffusion策略无缝连接相邻片段,减少边界伪影并增强过渡的平滑性。实验结果表明,我们的方法在不引入过高计算开销的前提下,显著提升了长期时间与内容的一致性。通过利用压缩标记和预训练的短视频模型,我们的方法为长视频生成提供了一个可扩展、模块化的解决方案,为叙事、电影制作和沉浸式模拟开辟了新的可能性。更多详情,请访问我们的项目页面:https://vicky0522.github.io/tokensgen-webpage/。
English
Generating consistent long videos is a complex challenge: while diffusion-based generative models generate visually impressive short clips, extending them to longer durations often leads to memory bottlenecks and long-term inconsistency. In this paper, we propose TokensGen, a novel two-stage framework that leverages condensed tokens to address these issues. Our method decomposes long video generation into three core tasks: (1) inner-clip semantic control, (2) long-term consistency control, and (3) inter-clip smooth transition. First, we train To2V (Token-to-Video), a short video diffusion model guided by text and video tokens, with a Video Tokenizer that condenses short clips into semantically rich tokens. Second, we introduce T2To (Text-to-Token), a video token diffusion transformer that generates all tokens at once, ensuring global consistency across clips. Finally, during inference, an adaptive FIFO-Diffusion strategy seamlessly connects adjacent clips, reducing boundary artifacts and enhancing smooth transitions. Experimental results demonstrate that our approach significantly enhances long-term temporal and content coherence without incurring prohibitive computational overhead. By leveraging condensed tokens and pre-trained short video models, our method provides a scalable, modular solution for long video generation, opening new possibilities for storytelling, cinematic production, and immersive simulations. Please see our project page at https://vicky0522.github.io/tokensgen-webpage/ .
PDF61July 22, 2025