世界モデルと言語モデル：具体的推論と抽象的推論の相補性について

要旨

ワールドモデルとマルチモーダル大規模言語モデル（MLLM）は、静的な視覚観測から将来の結果を予測するための補完的な能力を提供する。ワールドモデルは可能な未来の具体的な視覚的ロールアウトを生成できる一方、MLLMは質問、目標、ルールについて抽象的に推論できる。しかし、生成されたロールアウトは確率的であり、視覚的にはもっともらしいがタスクとしては不正確な場合があり、視覚的シミュレーションがいつ有用か、ロールアウトが信頼できるか、そして最終的な回答にどのように影響を与えるべきかを判断する必要がある。我々はこの問題を制御された具体推論として定式化する。ここではモデルが、抽象推論と並行して視覚的未来シミュレーションを呼び出し、検証し、統合することを学習する。この設定を研究するために、我々は人間検証済みのベンチマークである、制御可能な空間的先読みのためのVRQABenchと、オープンドメインの物理予測のためのOpenWorldQAを構築し、特権的未来オンポリシー自己蒸留（PF-OPSD）を提案する。訓練中、PF-OPSDは教師側の特権的コンテキストとしてのみグラウンドトゥルースの未来ビデオと回答を使用して、オンポリシーの具体推論軌跡を評価する一方、デプロイ可能な生徒はテスト時に真の未来を観測しない。実験結果は、PF-OPSDがVRQABenchとOpenWorldQAにおいてそれぞれベースラインを10.6%および10.9%上回り、ノイズが多いまたは矛盾するロールアウトに対するロバスト性も向上させることを示している。我々のコードとデータセットはhttps://github.com/yczhou001/PF-OPSDで入手可能である。

English

World models and multimodal large language models (MLLMs) provide complementary capabilities for predicting future outcomes from static visual observations. World models can generate concrete visual rollouts of possible futures, while MLLMs can reason abstractly over questions, goals, and rules. However, generated rollouts are stochastic and may be visually plausible but task-incorrect, making it necessary to determine when visual simulation is useful, whether a rollout is credible, and how it should influence the final answer. We formulate this problem as controlled concrete reasoning, where a model learns to invoke, verify, and integrate visual future simulation alongside abstract reasoning. To study this setting, we construct two human-verified benchmarks, VRQABench for controllable spatial lookahead and OpenWorldQA for open-domain physical prediction, and propose Privileged-Future On-Policy Self-Distillation (PF-OPSD). During training, PF-OPSD uses ground-truth future videos and answers only as teacher-side privileged context to evaluate on-policy concrete-reasoning trajectories, while the deployable student never observes true futures at test time. Experimental results show that PF-OPSD outperforms baseline by 10.6% and 10.9% on VRQABench and OpenWorldQA, respectively, while increasing robustness to noisy or conflicting rollouts. Our code and dataset are available at https://github.com/yczhou001/PF-OPSD.