VideoFlexTok：灵活长度从粗到细的视频令牌化方法

摘要

视觉分词器将高维原始像素映射为压缩表示以供下游建模。除压缩功能外，分词器还决定了信息的保留方式与组织结构。视频分词的事实标准方法是将视频表示为时空三维令牌网格，每个令牌捕获原始信号中对应的局部信息。这要求下游模型（如文生视频模型）必须学习"逐像素"预测所有低阶细节，而忽略视频固有复杂度，导致学习复杂度居高不下。我们提出VideoFlexTok，通过粗细粒度结合的变长令牌序列表示视频——初始令牌（涌现式）捕获抽象信息（如语义与运动特征），后续令牌补充细粒度细节。生成式流解码器支持从任意数量令牌实现逼真视频重建。这种表示结构允许根据下游需求调整令牌数量，并在同等预算下编码比基线更长的视频。我们在类别生成和文生视频任务上评估VideoFlexTok，结果表明相较于三维网格令牌，它能实现更高效的训练：仅用五分之一参数量（11亿vs52亿）即可达到相当生成质量（gFVD与ViCLIP评分）。最后通过训练文生视频模型演示其长视频生成能力：仅用672个令牌处理10秒81帧视频，令牌数量比同类三维网格分词器减少8倍，且无需承担过高计算成本。

English

Visual tokenizers map high-dimensional raw pixels into a compressed representation for downstream modeling. Beyond compression, tokenizers dictate what information is preserved and how it is organized. A de facto standard approach to video tokenization is to represent a video as a spatiotemporal 3D grid of tokens, each capturing the corresponding local information in the original signal. This requires the downstream model that consumes the tokens, e.g., a text-to-video model, to learn to predict all low-level details "pixel-by-pixel" irrespective of the video's inherent complexity, leading to high learning complexity. We present VideoFlexTok, which represents videos with a variable-length sequence of tokens structured in a coarse-to-fine manner -- where the first tokens (emergently) capture abstract information, such as semantics and motion, and later tokens add fine-grained details. The generative flow decoder enables realistic video reconstructions from any token count. This representation structure allows adapting the token count according to downstream needs and encoding videos longer than the baselines with the same budget. We evaluate VideoFlexTok on class- and text-to-video generative tasks and show that it leads to more efficient training compared to 3D grid tokens, e.g., achieving comparable generation quality (gFVD and ViCLIP Score) with a 5x smaller model (1.1B vs 5.2B). Finally, we demonstrate how VideoFlexTok can enable long video generation without prohibitive computational cost by training a text-to-video model on 10-second 81-frame videos with only 672 tokens, 8x fewer than a comparable 3D grid tokenizer.