Next Forcing: マルチチャンク予測による因果的世界モデリング

要旨

自己回帰型ビデオ生成は、世界行動モデル（WAM）の強力なパラダイムとして登場しました。しかし、既存の手法は、特に高いフレームレートにおいて、訓練の収束が遅く、収束精度も限定的であるという問題を抱えています。これは、訓練の教師信号が現在のチャンクに限定され、将来のダイナミクスに関する明示的な情報が欠如しているためです。また、反復的なビデオノイズ除去により推論が遅いという問題もあります。本論文では、より高速な訓練、高い精度、加速された推論を実現する、因果的世界モデリングのためのマルチチャンク予測（MCP）フレームワーク「Next Forcing」を提案します。大規模言語モデルにおけるマルチトークン予測に着想を得たNext Forcingは、MCP訓練目的を導入し、軽量な補助MCPモジュールで主モデルを拡張することで、複数の将来の時間地平線（次のチャンク¹、次のチャンク²、次のチャンク³）におけるビデオチャンクを同時にノイズ除去します。これらのMCPモジュールは予測深度にわたって因果連鎖を形成し、主モデルの複数層から融合された中間特徴量を活用して将来のダイナミクスを予測します。これにより、近い将来の予測がより遠い将来の予測に情報を提供し、主モデルに密なマルチスケールな時間的教師信号を提供します。訓練中、MCPモジュールは特に高いフレームレートにおいて収束を大幅に加速し、収束精度を向上させます。50 fpsにおいて、Next Forcingは5k訓練ステップでLingBot-VAに対して93.1%の相対的改善、2.3倍の高速収束を達成し、RoboTwinベンチマーク（Clean/Randomで94.1%/93.5%）で新たな最先端結果を確立しました。推論時には、MCPモジュールを保持して現在のチャンクと並行して次のビデオチャンクを予測でき、2倍の推論加速を実現します。Next Forcingは、ビデオ生成における物理法則の遵守を評価するベンチマークPhyWorldでも顕著な改善を示し、一般的なビデオ事前学習ではFVDを50%以上削減します。

English

Autoregressive video generation has emerged as a powerful paradigm for World Action Models (WAMs). However, existing approaches suffer from slow training convergence and limited converged accuracy, particularly at high frame rates, as the training supervision is confined to the current chunk without explicit signals about future dynamics; they also suffer from slow inference due to iterative video denoising. In this paper, we present Next Forcing, a multi-chunk prediction (MCP) framework for causal world modeling that enables faster training, higher accuracy, and accelerated inference. Inspired by multi-token prediction in large language models, Next Forcing introduces an MCP training objective that augments the main model with lightweight auxiliary MCP modules to simultaneously denoise video chunks at multiple future temporal horizons (next^1, next^2, next^3 chunks). These MCP modules form a causal chain across prediction depths, where intermediate features fused from multiple layers of the main model are leveraged to predict future dynamics, allowing near-future predictions to inform farther-future ones and providing dense multi-scale temporal supervision back to the main model. During training, the MCP modules significantly accelerate convergence and improve converged accuracy, especially at high frame rates: at 50 fps, Next Forcing achieves a 93.1% relative improvement over LingBot-VA at 5k training steps and 2.3x faster convergence, and establishes new state-of-the-art results on the RoboTwin benchmark (94.1/93.5% on Clean/Random). At inference, the MCP modules can be retained to predict the next video chunk in parallel with the current one, achieving 2x inference acceleration. Next Forcing also demonstrates significant improvements on PhyWorld, a benchmark evaluating adherence to physical laws in video generation, and over 50% FVD reduction on general video pretraining.