世界模型遇上语言模型：具体推理与抽象推理的互补性

摘要

世界模型与多模态大语言模型（MLLMs）在从静态视觉观测预测未来结果方面具有互补能力。世界模型可生成未来可能性的具体视觉推演，而MLLMs能对问题、目标与规则进行抽象推理。然而，生成的推演具有随机性，虽在视觉上看似合理，却可能不符合任务需求，因此需要判断视觉模拟在何种情境下具有实用性、推演结果是否可信、以及如何影响最终答案。我们将此问题定义为受控的具体推理（controlled concrete reasoning），即模型需学会调用、验证视觉未来模拟，并将其与抽象推理相整合。为研究该场景，我们构建了两个经人工验证的基准数据集：用于可控空间前瞻推理的VRQABench，以及面向开放域物理预测的OpenWorldQA，并提出特权未来在策略自蒸馏（Privileged-Future On-Policy Self-Distillation，PF-OPSD）。训练阶段，PF-OPSD仅将真实未来视频与答案作为教师侧的特权上下文，用以评估在策略具体推理轨迹，而部署阶段的学生模型在测试时从未观测真实未来。实验结果表明，PF-OPSD在VRQABench和OpenWorldQA上分别比基线模型提升10.6%和10.9%，同时增强了对噪声或冲突推演的鲁棒性。我们的代码与数据集已开源：https://github.com/yczhou001/PF-OPSD。

English

World models and multimodal large language models (MLLMs) provide complementary capabilities for predicting future outcomes from static visual observations. World models can generate concrete visual rollouts of possible futures, while MLLMs can reason abstractly over questions, goals, and rules. However, generated rollouts are stochastic and may be visually plausible but task-incorrect, making it necessary to determine when visual simulation is useful, whether a rollout is credible, and how it should influence the final answer. We formulate this problem as controlled concrete reasoning, where a model learns to invoke, verify, and integrate visual future simulation alongside abstract reasoning. To study this setting, we construct two human-verified benchmarks, VRQABench for controllable spatial lookahead and OpenWorldQA for open-domain physical prediction, and propose Privileged-Future On-Policy Self-Distillation (PF-OPSD). During training, PF-OPSD uses ground-truth future videos and answers only as teacher-side privileged context to evaluate on-policy concrete-reasoning trajectories, while the deployable student never observes true futures at test time. Experimental results show that PF-OPSD outperforms baseline by 10.6% and 10.9% on VRQABench and OpenWorldQA, respectively, while increasing robustness to noisy or conflicting rollouts. Our code and dataset are available at https://github.com/yczhou001/PF-OPSD.