下一步強制：使用多區塊預測的因果世界建模

摘要

自迴歸影片生成已成為世界動作模型（WAMs）的一個強大範式。然而，現有方法在訓練收斂速度與最終收斂精度上仍存在侷限，特別是在高幀率設定下——由於訓練監督僅限於當前區塊，缺乏對未來動態的明確信號；同時，因需迭代式影片去噪，推論速度亦受影響。本文提出「Next Forcing」，一種針對因果世界建模的多區塊預測（MCP）框架，能實現更快的訓練、更高的精度以及加速的推論。受大型語言模型中的多token預測啟發，Next Forcing引入MCP訓練目標：為主模型添加輕量級輔助MCP模組，使其能同時對多個未來時間視野（下一個、下兩個、下三個區塊）的影片區塊進行去噪。這些MCP模組在預測深度間形成因果鏈，利用從主模型多層融合而來的中間特徵來預測未來動態，使近期預測能輔助遠期預測，並為主模型提供密集的多尺度時間監督。訓練階段，MCP模組顯著加速收斂並提升收斂精度，特別是在高幀率場景：在50 fps下，Next Forcing在5,000訓練步中相較LingBot-VA取得93.1%的相對提升，收斂速度加快2.3倍，並在RoboTwin基準上創下新的最佳結果（Clean/Random分別為94.1%/93.5%）。推論階段，可保留MCP模組，使其在預測當前區塊的同時平行預測下一區塊，實現2倍推論加速。Next Forcing在PhyWorld（評估影片生成是否符合物理法則的基準）上也展現顯著進步，並在通用影片預訓練任務中將FVD降低超過50%。

English

Autoregressive video generation has emerged as a powerful paradigm for World Action Models (WAMs). However, existing approaches suffer from slow training convergence and limited converged accuracy, particularly at high frame rates, as the training supervision is confined to the current chunk without explicit signals about future dynamics; they also suffer from slow inference due to iterative video denoising. In this paper, we present Next Forcing, a multi-chunk prediction (MCP) framework for causal world modeling that enables faster training, higher accuracy, and accelerated inference. Inspired by multi-token prediction in large language models, Next Forcing introduces an MCP training objective that augments the main model with lightweight auxiliary MCP modules to simultaneously denoise video chunks at multiple future temporal horizons (next^1, next^2, next^3 chunks). These MCP modules form a causal chain across prediction depths, where intermediate features fused from multiple layers of the main model are leveraged to predict future dynamics, allowing near-future predictions to inform farther-future ones and providing dense multi-scale temporal supervision back to the main model. During training, the MCP modules significantly accelerate convergence and improve converged accuracy, especially at high frame rates: at 50 fps, Next Forcing achieves a 93.1% relative improvement over LingBot-VA at 5k training steps and 2.3x faster convergence, and establishes new state-of-the-art results on the RoboTwin benchmark (94.1/93.5% on Clean/Random). At inference, the MCP modules can be retained to predict the next video chunk in parallel with the current one, achieving 2x inference acceleration. Next Forcing also demonstrates significant improvements on PhyWorld, a benchmark evaluating adherence to physical laws in video generation, and over 50% FVD reduction on general video pretraining.