世界模型與語言模型相遇：論具體與抽象推理的互補性

摘要

世界模型與多模态大语言模型（MLLMs）在從靜態視覺觀測預測未來結果方面提供互補的能力。世界模型可以生成具體視覺的未來可能情境推演，而MLLM則能對問題、目標與規則進行抽象推理。然而，生成的推演具有隨機性，可能在視覺上合理但在任務上不正確，因此有必要判斷視覺模擬何時有用、推演是否可信，以及它應如何影響最終答案。我們將此問題形式化為受控具象推理，模型在此過程中學習調用、驗證並整合視覺未來模擬與抽象推理。為了研究此設定，我們建構了兩個經人工驗證的基準：VRQABench（用於可控空間前瞻）與OpenWorldQA（用於開放領域物理預測），並提出特權未來在策略自我蒸餾法（PF-OPSD）。在訓練過程中，PF-OPSD僅使用真實未來影片與答案作為教師端的特權上下文，以評估在策略具象推理軌跡，而可部署的學生模型在測試時從未觀察到真實未來。實驗結果顯示，PF-OPSD在VRQABench與OpenWorldQA上分別比基線高出10.6%與10.9%，同時增強了對於雜訊或衝突推演的魯棒性。我們的程式碼與資料集可於 https://github.com/yczhou001/PF-OPSD 獲取。

English

World models and multimodal large language models (MLLMs) provide complementary capabilities for predicting future outcomes from static visual observations. World models can generate concrete visual rollouts of possible futures, while MLLMs can reason abstractly over questions, goals, and rules. However, generated rollouts are stochastic and may be visually plausible but task-incorrect, making it necessary to determine when visual simulation is useful, whether a rollout is credible, and how it should influence the final answer. We formulate this problem as controlled concrete reasoning, where a model learns to invoke, verify, and integrate visual future simulation alongside abstract reasoning. To study this setting, we construct two human-verified benchmarks, VRQABench for controllable spatial lookahead and OpenWorldQA for open-domain physical prediction, and propose Privileged-Future On-Policy Self-Distillation (PF-OPSD). During training, PF-OPSD uses ground-truth future videos and answers only as teacher-side privileged context to evaluate on-policy concrete-reasoning trajectories, while the deployable student never observes true futures at test time. Experimental results show that PF-OPSD outperforms baseline by 10.6% and 10.9% on VRQABench and OpenWorldQA, respectively, while increasing robustness to noisy or conflicting rollouts. Our code and dataset are available at https://github.com/yczhou001/PF-OPSD.