EVATok：面向高效视觉自回归生成的自适应长度视频分词技术

摘要

自回归视频生成模型依赖于将像素压缩为离散标记序列的视频分词器。这些标记序列的长度对于平衡重建质量与下游生成计算成本至关重要。传统视频分词器在不同视频的时间块上采用统一的标记分配方案，常将标记浪费在简单、静态或重复的片段上，而对动态或复杂片段分配不足。为解决这一效率问题，我们提出EVATok框架，用于生成高效视频自适应分词器。该框架通过估算每个视频的最优标记分配以实现最佳质量-成本权衡，开发轻量级路由器快速预测这些最优分配，并训练能根据路由器预测结果进行编码的自适应分词器。实验表明，EVATok在视频重建和下游自回归生成的效率与整体质量上实现显著提升。通过集成视频语义编码器的先进训练方案，EVATok在UCF-101数据集上实现了卓越的重建效果和顶尖的类别到视频生成性能，与先前最优的LARP方法及我们的定长基线相比，平均标记使用量至少降低24.4%。

English

Autoregressive (AR) video generative models rely on video tokenizers that compress pixels into discrete token sequences. The length of these token sequences is crucial for balancing reconstruction quality against downstream generation computational cost. Traditional video tokenizers apply a uniform token assignment across temporal blocks of different videos, often wasting tokens on simple, static, or repetitive segments while underserving dynamic or complex ones. To address this inefficiency, we introduce EVATok, a framework to produce Efficient Video Adaptive Tokenizers. Our framework estimates optimal token assignments for each video to achieve the best quality-cost trade-off, develops lightweight routers for fast prediction of these optimal assignments, and trains adaptive tokenizers that encode videos based on the assignments predicted by routers. We demonstrate that EVATok delivers substantial improvements in efficiency and overall quality for video reconstruction and downstream AR generation. Enhanced by our advanced training recipe that integrates video semantic encoders, EVATok achieves superior reconstruction and state-of-the-art class-to-video generation on UCF-101, with at least 24.4% savings in average token usage compared to the prior state-of-the-art LARP and our fixed-length baseline.