視覺規劃：讓我們僅以圖像思考

摘要

大型語言模型（LLMs）及其多模態擴展（MLLMs）的最新進展，顯著提升了機器在各種任務中的推理能力。然而，這些模型主要依賴純文本作為表達和結構化推理的媒介，即使視覺信息存在時也是如此。在本研究中，我們主張語言可能並非總是進行推理最自然或最有效的模態，尤其是在涉及空間和幾何信息的任務中。基於此，我們提出了一種新範式——視覺規劃，它允許通過純視覺表示進行規劃，獨立於文本。在這一範式中，規劃是通過一系列圖像來執行的，這些圖像在視覺領域中編碼了逐步推理的過程，類似於人類繪製或可視化未來行動的方式。我們引入了一種新穎的強化學習框架——基於強化學習的視覺規劃（VPRL），並利用GRPO對大型視覺模型進行後訓練，從而在代表性視覺導航任務（如FrozenLake、Maze和MiniBehavior）中大幅提升了規劃能力。我們的視覺規劃範式在僅依賴文本空間進行推理的所有規劃變體中表現最佳。研究結果確立了視覺規劃作為基於語言推理的一種可行且前景廣闊的替代方案，為那些受益於直覺、基於圖像推理的任務開闢了新途徑。

English

Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations, independent of text. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising alternative to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.