基於視覺語言世界模型的推理規劃
Planning with Reasoning using Vision Language World Model
September 2, 2025
作者: Delong Chen, Theo Moutakanni, Willy Chung, Yejin Bang, Ziwei Ji, Allen Bolourchi, Pascale Fung
cs.AI
摘要
有效的規劃需要強大的世界模型,然而能夠理解並對具有語義和時間抽象性的行動進行推理的高層次世界模型仍大多未得到充分發展。我們引入了視覺語言世界模型(VLWM),這是一個基於自然視頻進行語言世界建模訓練的基礎模型。面對視覺觀察,VLWM首先推斷總體目標的達成情況,隨後預測由交錯的行動與世界狀態變化構成的軌跡。這些目標是通過迭代的LLM自我精煉過程提取的,該過程以由“標題樹”表示的壓縮未來觀察為條件。VLWM同時學習了行動策略和動態模型,分別促進了基於反應的系統-1計劃解碼和通過成本最小化實現的反思性系統-2規劃。成本評估了由VLWM推演給出的假設未來狀態與預期目標狀態之間的語義距離,並由我們以自監督方式訓練的批評模型進行度量。VLWM在基準評估和我們提出的PlannerArena人類評估中均達到了視覺輔助規劃(VPA)的最新性能水平,其中系統-2相較於系統-1將Elo評分提升了+27%。此外,VLWM模型在RoboVQA和WorldPrediction基準測試中也超越了強大的VLM基線模型。
English
Effective planning requires strong world models, but high-level world models
that can understand and reason about actions with semantic and temporal
abstraction remain largely underdeveloped. We introduce the Vision Language
World Model (VLWM), a foundation model trained for language-based world
modeling on natural videos. Given visual observations, the VLWM first infers
the overall goal achievements then predicts a trajectory composed of
interleaved actions and world state changes. Those targets are extracted by
iterative LLM Self-Refine conditioned on compressed future observations
represented by Tree of Captions. The VLWM learns both an action policy and a
dynamics model, which respectively facilitates reactive system-1 plan decoding
and reflective system-2 planning via cost minimization. The cost evaluates the
semantic distance between the hypothetical future states given by VLWM
roll-outs and the expected goal state, and is measured by a critic model that
we trained in a self-supervised manner. The VLWM achieves state-of-the-art
Visual Planning for Assistance (VPA) performance on both benchmark evaluations
and our proposed PlannerArena human evaluations, where system-2 improves the
Elo score by +27% upon system-1. The VLWM models also outperforms strong VLM
baselines on RoboVQA and WorldPrediction benchmark.