세계 모델과 언어 모델의 만남: 구체적 추론과 추상적 추론의 상호보완성에 관하여

초록

세계 모델과 다중모드 대규모 언어 모델(MLLM)은 정적 시각 관찰로부터 미래 결과를 예측하는 데 상호 보완적 기능을 제공한다. 세계 모델은 가능한 미래의 구체적인 시각적 롤아웃(rollout)을 생성할 수 있는 반면, MLLM은 질문, 목표 및 규칙에 대해 추상적으로 추론할 수 있다. 그러나 생성된 롤아웃은 확률적이며 시각적으로 그럴듯하지만 작업에 부적절할 수 있으므로, 시각적 시뮬레이션이 언제 유용한지, 롤아웃이 신뢰할 수 있는지, 그리고 최종 답변에 어떻게 영향을 미쳐야 하는지 결정하는 것이 필요하다. 우리는 이 문제를 통제된 구체적 추론(controlled concrete reasoning)으로 정식화하며, 여기서 모델은 추상적 추론과 함께 시각적 미래 시뮬레이션을 호출하고, 검증하며, 통합하는 방법을 학습한다. 이 설정을 연구하기 위해, 우리는 제어 가능한 공간적 예측(controllable spatial lookahead)을 위한 VRQABench와 개방형 도메인 물리적 예측(open-domain physical prediction)을 위한 OpenWorldQA라는 두 가지 인간 검증 벤치마크를 구축하고, 특권 미래 온-정책 자기 증류(Privileged-Future On-Policy Self-Distillation, PF-OPSD)를 제안한다. 훈련 중에 PF-OPSD는 실제 미래 비디오와 답변만을 교사 측 특권 컨텍스트로 사용하여 온-정책 구체적 추론 궤적을 평가하는 반면, 배포 가능한 학생은 테스트 시점에 실제 미래를 관찰하지 않는다. 실험 결과 PF-OPSD는 VRQABench와 OpenWorldQA에서 각각 기준 대비 10.6%와 10.9% 더 나은 성능을 보였으며, 노이즈가 있거나 상충되는 롤아웃에 대한 견고성을 증가시켰다. 우리의 코드와 데이터셋은 https://github.com/yczhou001/PF-OPSD에서 확인할 수 있다.

English

World models and multimodal large language models (MLLMs) provide complementary capabilities for predicting future outcomes from static visual observations. World models can generate concrete visual rollouts of possible futures, while MLLMs can reason abstractly over questions, goals, and rules. However, generated rollouts are stochastic and may be visually plausible but task-incorrect, making it necessary to determine when visual simulation is useful, whether a rollout is credible, and how it should influence the final answer. We formulate this problem as controlled concrete reasoning, where a model learns to invoke, verify, and integrate visual future simulation alongside abstract reasoning. To study this setting, we construct two human-verified benchmarks, VRQABench for controllable spatial lookahead and OpenWorldQA for open-domain physical prediction, and propose Privileged-Future On-Policy Self-Distillation (PF-OPSD). During training, PF-OPSD uses ground-truth future videos and answers only as teacher-side privileged context to evaluate on-policy concrete-reasoning trajectories, while the deployable student never observes true futures at test time. Experimental results show that PF-OPSD outperforms baseline by 10.6% and 10.9% on VRQABench and OpenWorldQA, respectively, while increasing robustness to noisy or conflicting rollouts. Our code and dataset are available at https://github.com/yczhou001/PF-OPSD.