基于视觉语言世界模型的推理规划

摘要

有效的规划需要强大的世界模型，然而，能够理解并基于语义与时间抽象进行行动推理的高层次世界模型仍大多处于发展不足的状态。我们提出了视觉语言世界模型（VLWM），这是一个针对自然视频进行语言建模训练的基础模型。面对视觉观察，VLWM首先推断总体目标达成情况，随后预测由交替行动与世界状态变化构成的轨迹。这些目标通过迭代式大语言模型自我精炼（LLM Self-Refine）提取，该过程以“标题树”形式压缩的未来观察为条件。VLWM同时学习行动策略与动态模型，分别促进基于反应的系统一计划解码和通过成本最小化实现的反思性系统二规划。成本评估由VLWM推演给出的假设未来状态与预期目标状态之间的语义距离，并由我们以自监督方式训练的批评模型进行度量。VLWM在基准评估及我们提出的PlannerArena人类评估中，均实现了视觉辅助规划（VPA）性能的领先，其中系统二较系统一提升了Elo评分+27%。此外，VLWM模型在RoboVQA与世界预测基准测试中也超越了强大的视觉语言模型（VLM）基线。

English

Effective planning requires strong world models, but high-level world models that can understand and reason about actions with semantic and temporal abstraction remain largely underdeveloped. We introduce the Vision Language World Model (VLWM), a foundation model trained for language-based world modeling on natural videos. Given visual observations, the VLWM first infers the overall goal achievements then predicts a trajectory composed of interleaved actions and world state changes. Those targets are extracted by iterative LLM Self-Refine conditioned on compressed future observations represented by Tree of Captions. The VLWM learns both an action policy and a dynamics model, which respectively facilitates reactive system-1 plan decoding and reflective system-2 planning via cost minimization. The cost evaluates the semantic distance between the hypothetical future states given by VLWM roll-outs and the expected goal state, and is measured by a critic model that we trained in a self-supervised manner. The VLWM achieves state-of-the-art Visual Planning for Assistance (VPA) performance on both benchmark evaluations and our proposed PlannerArena human evaluations, where system-2 improves the Elo score by +27% upon system-1. The VLWM models also outperforms strong VLM baselines on RoboVQA and WorldPrediction benchmark.

基于视觉语言世界模型的推理规划

Planning with Reasoning using Vision Language World Model

摘要

Support