フレームは1トークンに相当する：デルタトークンによる効率的な生成的ワールドモデリング

要旨

多様な未来状態を予測することは、ビデオ世界モデリングにおける中心的な課題である。識別的世界モデルは、可能な未来を暗黙的に平均化した決定論的予測を生成するが、既存の生成的世界モデルは計算コストが高いままである。最近の研究では、ピクセル再構成に最適化された潜在空間ではなく、視覚基盤モデル（VFM）の特徴空間で未来を予測することで、世界モデルのパラメータ数を大幅に削減できることが示されている。しかし、そのようなアプローチのほとんどは依然として識別的である。本研究では、連続フレーム間のVFM特徴量の差分を単一の連続的な「デルタ」トークンに符号化するトークナイザーDeltaTokと、これらのトークン上で動作し、多様な可能性のある未来を効率的に生成する生成的世界モデルDeltaWorldを提案する。デルタトークンは、ビデオを3次元の時空間表現から1次元の時間系列に縮約し、例えば512x512フレームの場合1,024倍のトークン削減を実現する。このコンパクトな表現は、多数の未来を並列生成し最良のもののみを教師とする、扱いやすい多仮説訓練を可能にする。推論時には、単一の順伝播で多様な予測が得られる。密な予測タスクにおける実験により、DeltaWorldが実世界の結果により合致する未来を予測しつつ、既存の生成的世界モデルと比較してパラメータ数が35倍以上少なく、FLOPsを2,000倍少なく使用することを実証する。コードと重み：https://deltatok.github.io

English

Anticipating diverse future states is a central challenge in video world modeling. Discriminative world models produce a deterministic prediction that implicitly averages over possible futures, while existing generative world models remain computationally expensive. Recent work demonstrates that predicting the future in the feature space of a vision foundation model (VFM), rather than a latent space optimized for pixel reconstruction, requires significantly fewer world model parameters. However, most such approaches remain discriminative. In this work, we introduce DeltaTok, a tokenizer that encodes the VFM feature difference between consecutive frames into a single continuous "delta" token, and DeltaWorld, a generative world model operating on these tokens to efficiently generate diverse plausible futures. Delta tokens reduce video from a three-dimensional spatio-temporal representation to a one-dimensional temporal sequence, for example yielding a 1,024x token reduction with 512x512 frames. This compact representation enables tractable multi-hypothesis training, where many futures are generated in parallel and only the best is supervised. At inference, this leads to diverse predictions in a single forward pass. Experiments on dense forecasting tasks demonstrate that DeltaWorld forecasts futures that more closely align with real-world outcomes, while having over 35x fewer parameters and using 2,000x fewer FLOPs than existing generative world models. Code and weights: https://deltatok.github.io.

フレームは1トークンに相当する：デルタトークンによる効率的な生成的ワールドモデリング

A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens

要旨

Support