ChatPaper.aiChatPaper

一帧即一令:基于差分标记的高效生成式世界建模

A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens

April 6, 2026
作者: Tommie Kerssies, Gabriele Berton, Ju He, Qihang Yu, Wufei Ma, Daan de Geus, Gijs Dubbelman, Liang-Chieh Chen
cs.AI

摘要

预测多样化的未来状态是视频世界建模的核心挑战。判别式世界模型生成确定性预测,隐式地平均了所有可能未来;而现有生成式世界模型仍存在计算成本过高的问题。最新研究表明,在视觉基础模型(VFM)的特征空间(而非为像素重建优化的潜空间)中进行未来预测,可大幅减少世界模型参数量。然而,此类方法大多仍属判别式。本文提出DeltaTok——一种将连续帧间VFM特征差异编码为连续"差值"标记的标记器,以及DeltaWorld——基于这些标记运行的生成式世界模型,可高效生成多样化的合理未来。差值标记将视频从三维时空表示简化为一维时间序列,例如对512x512帧序列可实现1024倍的标记压缩。这种紧凑表征使得可并行生成多个未来假设、仅监督最优结果的多假设训练成为可能。在推理阶段,该方法能通过单次前向传播实现多样化预测。在密集预测任务上的实验表明,DeltaWorld预测的未来与现实结果吻合度更高,同时参数量比现有生成式世界模型减少35倍以上,计算量减少2000倍。代码与权重:https://deltatok.github.io。
English
Anticipating diverse future states is a central challenge in video world modeling. Discriminative world models produce a deterministic prediction that implicitly averages over possible futures, while existing generative world models remain computationally expensive. Recent work demonstrates that predicting the future in the feature space of a vision foundation model (VFM), rather than a latent space optimized for pixel reconstruction, requires significantly fewer world model parameters. However, most such approaches remain discriminative. In this work, we introduce DeltaTok, a tokenizer that encodes the VFM feature difference between consecutive frames into a single continuous "delta" token, and DeltaWorld, a generative world model operating on these tokens to efficiently generate diverse plausible futures. Delta tokens reduce video from a three-dimensional spatio-temporal representation to a one-dimensional temporal sequence, for example yielding a 1,024x token reduction with 512x512 frames. This compact representation enables tractable multi-hypothesis training, where many futures are generated in parallel and only the best is supervised. At inference, this leads to diverse predictions in a single forward pass. Experiments on dense forecasting tasks demonstrate that DeltaWorld forecasts futures that more closely align with real-world outcomes, while having over 35x fewer parameters and using 2,000x fewer FLOPs than existing generative world models. Code and weights: https://deltatok.github.io.
PDF21April 10, 2026