AlphaMaze：透過GRPO提升大型語言模型的空間智能

摘要

大型語言模型（LLMs）在語言處理方面展現了令人印象深刻的能力，但在需要真正視覺空間推理的任務上往往表現不佳。本文提出了一種新穎的兩階段訓練框架，旨在為標準LLMs配備迷宮導航所需的視覺推理能力。首先，我們利用監督微調（SFT）在精選的符號化迷宮表示數據集上，教導模型預測逐步移動指令。接著，我們應用群組相對策略優化（GRPO）——一種在DeepSeekR1中使用的技術——並精心設計獎勵函數，以精煉模型的序列決策能力，並鼓勵其產生鏈式思維行為。在合成生成的迷宮上的實驗結果顯示，雖然基線模型無法成功導航，但經過SFT訓練的模型達到了86%的準確率，而進一步的GRPO微調則將準確率提升至93%。定性分析表明，GRPO促進了更為穩健和自我修正的推理，凸顯了我們的方法在彌合語言模型與視覺空間任務之間差距的潛力。這些發現為機器人學、自主導航以及其他需要整合視覺與序列推理的應用領域提供了有前景的啟示。

English

Large Language Models (LLMs) have demonstrated impressive capabilities in language processing, yet they often struggle with tasks requiring genuine visual spatial reasoning. In this paper, we introduce a novel two-stage training framework designed to equip standard LLMs with visual reasoning abilities for maze navigation. First, we leverage Supervised Fine Tuning (SFT) on a curated dataset of tokenized maze representations to teach the model to predict step-by-step movement commands. Next, we apply Group Relative Policy Optimization (GRPO)-a technique used in DeepSeekR1-with a carefully crafted reward function to refine the model's sequential decision-making and encourage emergent chain-of-thought behaviors. Experimental results on synthetically generated mazes show that while a baseline model fails to navigate the maze, the SFT-trained model achieves 86% accuracy, and further GRPO fine-tuning boosts accuracy to 93%. Qualitative analyses reveal that GRPO fosters more robust and self-corrective reasoning, highlighting the potential of our approach to bridge the gap between language models and visual spatial tasks. These findings offer promising implications for applications in robotics, autonomous navigation, and other domains that require integrated visual and sequential reasoning.

AlphaMaze：透過GRPO提升大型語言模型的空間智能

AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO

摘要

Support