SeeNav-Agent:通过视觉提示与步级策略优化增强视觉语言导航能力
SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization
December 2, 2025
作者: Zhengcheng Wang, Zichuan Lin, Yijun Yang, Haobo Fu, Deheng Ye
cs.AI
摘要
现有基于大型视觉语言模型(LVLM)的视觉语言导航(VLN)智能体常受感知偏差、推理谬误与规划失误的制约,严重影响了其导航性能。为突破这些局限,本研究提出了一种新型VLN智能体框架SeeNav-Agent。首先,为降低VLN智能体视觉模块的感知幻觉,我们在输入空间引入了双视角视觉提示(VP)技术,该技术同时能增强智能体对当前空间状态的理解。随后,我们设计了一种面向VLN智能体后训练的步进级强化微调(RFT)新方法——步进奖励分组策略优化(SRGPO)。该方法先为导航任务定义可验证的过程奖励,再通过随机分组不同导航步骤实现高效的步进级优势估计。SRGPO为VLN智能体的强化学习过程提供了密集奖励信号,显著提升了其规划能力。在EmbodiedBench导航基准上的实验结果表明:通过引入零样本VP模块,GPT-4.1的导航成功率达到了86.7%,较当前最优LVLM提升约20个百分点;基于SRGPO的后训练使Qwen2.5-VL-3B模型的导航成功率达到72.3%,超越现有最佳LVLM模型5.6个百分点。此外,与GRPO、GiGPO等RFT算法相比,SRGPO在训练稳定性、收敛效率及泛化能力方面均展现出显著优势。
English
Existing Vision-Language Navigation (VLN) agents based on Large Vision-Language Models (LVLMs) often suffer from perception errors, reasoning errors, and planning errors, which significantly hinder their navigation performance. To address these limitations, a novel VLN agent framework, named SeeNav-Agent, is proposed in this work. First, to reduce perception hallucinations of the visual module of the VLN agent, a dual-view Visual Prompt (VP) technique is introduced in the input space, which can also improve the agent's understanding of current spatial states. Subsequently, a novel step-level Reinforcement Fine-Tuning (RFT) method, Step Reward Group Policy Optimization (SRGPO), is designed for the post-training of VLN agents. In SRGPO, we first define verifiable process rewards for the navigation task, and then perform efficient step-level advantage estimation by randomly grouping different navigation steps. SRGPO provides dense reward signals for the reinforcement learning process of the VLN agent and enhances its planning capability. Experimental results on the EmbodiedBench Navigation benchmark indicate that by introducing the zero-shot VP module, the GPT-4.1 achieves a navigation success rate of 86.7%, surpassing the current best LVLM by approximately 20 percentage points (pp). Through post-training based on SRGPO, the Qwen2.5-VL-3B model reaches a navigation success rate of 72.3%, outperforming the best existing LVLM model by 5.6 pp. Moreover, compared to RFT algorithms such as GRPO and GiGPO, the proposed SRGPO demonstrates significant improvements in training stability, convergence efficiency, and generalization capability.