Next Forcing: 다중 청크 예측을 통한 인과적 세계 모델링

초록

자기회귀적 비디오 생성은 세계 행동 모델(World Action Models, WAMs)을 위한 강력한 패러다임으로 부상하였다. 그러나 기존 접근 방식은 느린 훈련 수렴과 제한된 수렴 정확도, 특히 높은 프레임 속도에서 어려움을 겪는데, 이는 훈련 감독이 미래 역학에 대한 명시적 신호 없이 현재 청크에 국한되기 때문이며, 반복적인 비디오 잡음 제거로 인해 추론 속도도 느리다. 본 논문에서는 더 빠른 훈련, 더 높은 정확도, 그리고 가속화된 추론을 가능하게 하는 인과적 세계 모델링을 위한 다중 청크 예측(Multi-Chunk Prediction, MCP) 프레임워크인 Next Forcing을 제시한다. 대규모 언어 모델의 다중 토큰 예측에서 영감을 받은 Next Forcing은 MCP 훈련 목표를 도입하여, 경량의 보조 MCP 모듈로 주 모델을 증강시켜 여러 미래 시간 지평(다음¹, 다음², 다음³ 청크)에서 비디오 청크를 동시에 잡음 제거한다. 이러한 MCP 모듈은 예측 깊이에 걸쳐 인과적 사슬을 형성하며, 주 모델의 여러 계층에서 융합된 중간 특징을 활용하여 미래 역학을 예측함으로써 가까운 미래 예측이 더 먼 미래 예측에 정보를 제공할 수 있게 하고, 주 모델에 조밀한 다중 규모 시간적 감독을 제공한다. 훈련 중 MCP 모듈은 특히 높은 프레임 속도에서 수렴을 크게 가속화하고 수렴 정확도를 향상시킨다: 50fps에서 Next Forcing은 5k 훈련 단계에서 LingBot-VA 대비 93.1%의 상대적 개선과 2.3배 빠른 수렴을 달성하며, RoboTwin 벤치마크(Clean/Random에서 각각 94.1/93.5%)에서 새로운 최첨단 결과를 수립한다. 추론 시 MCP 모듈을 유지하여 현재 청크와 병렬로 다음 비디오 청크를 예측할 수 있어 2배의 추론 가속을 달성한다. Next Forcing은 비디오 생성에서 물리 법칙 준수를 평가하는 PhyWorld 벤치마크에서도 상당한 개선을 보여주며, 일반 비디오 사전 훈련에서 FVD가 50% 이상 감소한다.

English

Autoregressive video generation has emerged as a powerful paradigm for World Action Models (WAMs). However, existing approaches suffer from slow training convergence and limited converged accuracy, particularly at high frame rates, as the training supervision is confined to the current chunk without explicit signals about future dynamics; they also suffer from slow inference due to iterative video denoising. In this paper, we present Next Forcing, a multi-chunk prediction (MCP) framework for causal world modeling that enables faster training, higher accuracy, and accelerated inference. Inspired by multi-token prediction in large language models, Next Forcing introduces an MCP training objective that augments the main model with lightweight auxiliary MCP modules to simultaneously denoise video chunks at multiple future temporal horizons (next^1, next^2, next^3 chunks). These MCP modules form a causal chain across prediction depths, where intermediate features fused from multiple layers of the main model are leveraged to predict future dynamics, allowing near-future predictions to inform farther-future ones and providing dense multi-scale temporal supervision back to the main model. During training, the MCP modules significantly accelerate convergence and improve converged accuracy, especially at high frame rates: at 50 fps, Next Forcing achieves a 93.1% relative improvement over LingBot-VA at 5k training steps and 2.3x faster convergence, and establishes new state-of-the-art results on the RoboTwin benchmark (94.1/93.5% on Clean/Random). At inference, the MCP modules can be retained to predict the next video chunk in parallel with the current one, achieving 2x inference acceleration. Next Forcing also demonstrates significant improvements on PhyWorld, a benchmark evaluating adherence to physical laws in video generation, and over 50% FVD reduction on general video pretraining.