TV2TV:交錯式語言與影片生成的統一框架
TV2TV: A Unified Framework for Interleaved Language and Video Generation
December 4, 2025
作者: Xiaochuang Han, Youssef Emad, Melissa Hall, John Nguyen, Karthik Padthe, Liam Robbins, Amir Bar, Delong Chen, Michal Drozdzal, Maha Elbayad, Yushi Hu, Shang-Wen Li, Sreya Dutta Roy, Jakob Verbeek, XuDong Wang, Marjan Ghazvininejad, Luke Zettlemoyer, Emily Dinan
cs.AI
摘要
视频生成模型正快速发展,但在处理需要复杂语义分支或对后续内容进行重复高层推理的复杂视频输出时仍面临挑战。本文提出一类新型全视频-文本模型,通过融合近期语言模型推理进展中的思路来解决这一难题。具体而言,我们提出TV2TV——一个将视频生成解构为交错式文本与视频生成过程的统一生成建模框架。该框架采用混合变换器架构,联合学习语言建模(下一词元预测)和视频流匹配(下一帧预测)。在推理阶段,TV2TV能自主决定文本生成与视频帧生成的交替时机,使模型在"像素化呈现"前可先通过"文字化思考"规划后续内容。这一设计将大量后续内容决策任务转移至语言建模模块,从而提升生成视频的视觉质量与提示对齐度。同时支持细粒度可控性,允许用户在生成过程中任意节点通过文本干预调整视频生成轨迹。在电子游戏数据的受控实验中,TV2TV在视觉质量与可控性方面均实现显著提升。该模型还可扩展至自然视频场景:我们通过视觉语言模型为体育视频嵌入交错式自然语言动作描述,在此语料上训练的TV2TV展现出优异的视觉质量与提示对齐能力,证明了模型对现实世界复杂动作序列的推理与生成能力。这些成果共同表明,TV2TV为实现具有开放式文本推理与控制能力的视频生成迈出了重要一步。
English
Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to "think in words" about subsequent content before ``acting in pixels'' to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model's ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.