ChatPaper.aiChatPaper

视觉规划:让我们仅凭图像思考

Visual Planning: Let's Think Only with Images

May 16, 2025
作者: Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, Ivan Vulić
cs.AI

摘要

近年来,大型语言模型(LLMs)及其多模态扩展(MLLMs)的显著进展极大地提升了机器在多样化任务中的推理能力。然而,这些模型主要依赖纯文本作为表达和构建推理的媒介,即便在视觉信息存在的情况下也是如此。本研究提出,语言可能并非总是最自然或最有效的推理模态,尤其是在涉及空间和几何信息的任务中。基于此,我们倡导一种新范式——视觉规划,它通过纯视觉表示进行规划,独立于文本。在这一范式中,规划通过一系列图像执行,这些图像在视觉领域编码逐步推理,类似于人类绘制或设想未来行动的方式。我们引入了一种新颖的强化学习框架——视觉规划强化学习(VPRL),该框架借助GRPO技术对大型视觉模型进行后训练,从而在代表性视觉导航任务(如FrozenLake、Maze和MiniBehavior)中显著提升了规划能力。我们的视觉规划范式在所有仅依赖文本空间进行推理的规划变体中表现优异。研究结果确立了视觉规划作为基于语言推理的可行且有前景的替代方案,为那些受益于直观、基于图像推理的任务开辟了新途径。
English
Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations, independent of text. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising alternative to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.

Summary

AI-Generated Summary

PDF374May 19, 2025