SeeNav-Agent:通过视觉提示与步级策略优化增强视觉语言导航能力
SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization
December 2, 2025
作者: Zhengcheng Wang, Zichuan Lin, Yijun Yang, Haobo Fu, Deheng Ye
cs.AI
摘要
现有基于大规模视觉语言模型(LVLM)的视觉语言导航(VLN)智能体常受感知偏差、推理错误与规划失误的困扰,严重制约其导航性能。为突破这些局限,本文提出新型VLN智能体框架SeeNav-Agent。首先,为降低VLN智能体视觉模块的感知幻觉,我们在输入空间引入双视角视觉提示(VP)技术,该技术同时能增强智能体对当前空间状态的理解。随后,针对VLN智能体的后训练阶段,我们设计了一种创新的步级强化微调(RFT)方法——步进奖励分组策略优化(SRGPO)。该方法首先为导航任务定义可验证的过程奖励,继而通过随机分组不同导航步数实现高效的步级优势估计。SRGPO为VLN智能体的强化学习过程提供密集奖励信号,显著提升其规划能力。在EmbodiedBench导航基准上的实验表明:引入零样本VP模块后,GPT-4.1的导航成功率达86.7%,较当前最优LVLM提升约20个百分点;基于SRGPO后训练的Qwen2.5-VL-3B模型导航成功率达72.3%,超越现有最佳LVLM模型5.6个百分点。此外,与GRPO、GiGPO等RFT算法相比,SRGPO在训练稳定性、收敛效率与泛化能力方面均展现出显著优势。
English
Existing Vision-Language Navigation (VLN) agents based on Large Vision-Language Models (LVLMs) often suffer from perception errors, reasoning errors, and planning errors, which significantly hinder their navigation performance. To address these limitations, a novel VLN agent framework, named SeeNav-Agent, is proposed in this work. First, to reduce perception hallucinations of the visual module of the VLN agent, a dual-view Visual Prompt (VP) technique is introduced in the input space, which can also improve the agent's understanding of current spatial states. Subsequently, a novel step-level Reinforcement Fine-Tuning (RFT) method, Step Reward Group Policy Optimization (SRGPO), is designed for the post-training of VLN agents. In SRGPO, we first define verifiable process rewards for the navigation task, and then perform efficient step-level advantage estimation by randomly grouping different navigation steps. SRGPO provides dense reward signals for the reinforcement learning process of the VLN agent and enhances its planning capability. Experimental results on the EmbodiedBench Navigation benchmark indicate that by introducing the zero-shot VP module, the GPT-4.1 achieves a navigation success rate of 86.7%, surpassing the current best LVLM by approximately 20 percentage points (pp). Through post-training based on SRGPO, the Qwen2.5-VL-3B model reaches a navigation success rate of 72.3%, outperforming the best existing LVLM model by 5.6 pp. Moreover, compared to RFT algorithms such as GRPO and GiGPO, the proposed SRGPO demonstrates significant improvements in training stability, convergence efficiency, and generalization capability.