ChatPaper.aiChatPaper

VLingNav:基于自适应推理与视觉辅助语言记忆的具身导航系统

VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory

January 13, 2026
作者: Shaoan Wang, Yuanfei Luo, Xingyu Chen, Aocheng Luo, Dongyue Li, Chang Liu, Sheng Chen, Yangang Zhang, Junzhi Yu
cs.AI

摘要

视觉语言行动模型通过统一感知与规划,同时继承大型视觉语言模型的强大泛化能力,在具身导航领域展现出巨大潜力。然而,现有VLA模型大多依赖从观察到动作的被动映射,缺乏处理复杂长程导航任务所需的显式推理能力和持久记忆机制。为应对这些挑战,我们提出基于语言驱动认知的具身导航模型VLingNav。首先,受人类认知双加工理论启发,我们引入自适应思维链机制,该机制仅在必要时动态触发显式推理,使智能体能在快速直觉执行与慢速审慎规划间流畅切换。其次,针对长程空间依赖关系,我们开发了视觉辅助语言记忆模块,构建具有持续性的跨模态语义记忆,使智能体能回溯历史观察以避免重复探索,并推断动态环境中的运动趋势。在训练方案上,我们构建了迄今最大规模的具身导航推理标注数据集Nav-AdaCoT-2.9M,其中包含诱导模型自主调整"何时思考"与"思考内容"的自适应思维链标注。此外,我们引入在线专家指导的强化学习阶段,使模型突破纯模仿学习局限,获得更鲁棒的自主探索导航行为。大量实验表明,VLingNav在多种具身导航基准测试中均达到最先进性能。值得注意的是,VLingNav能以零样本方式迁移至真实机器人平台,执行多样化导航任务,展现出强大的跨领域与跨任务泛化能力。
English
VLA models have shown promising potential in embodied navigation by unifying perception and planning while inheriting the strong generalization abilities of large VLMs. However, most existing VLA models rely on reactive mappings directly from observations to actions, lacking the explicit reasoning capabilities and persistent memory required for complex, long-horizon navigation tasks. To address these challenges, we propose VLingNav, a VLA model for embodied navigation grounded in linguistic-driven cognition. First, inspired by the dual-process theory of human cognition, we introduce an adaptive chain-of-thought mechanism, which dynamically triggers explicit reasoning only when necessary, enabling the agent to fluidly switch between fast, intuitive execution and slow, deliberate planning. Second, to handle long-horizon spatial dependencies, we develop a visual-assisted linguistic memory module that constructs a persistent, cross-modal semantic memory, enabling the agent to recall past observations to prevent repetitive exploration and infer movement trends for dynamic environments. For the training recipe, we construct Nav-AdaCoT-2.9M, the largest embodied navigation dataset with reasoning annotations to date, enriched with adaptive CoT annotations that induce a reasoning paradigm capable of adjusting both when to think and what to think about. Moreover, we incorporate an online expert-guided reinforcement learning stage, enabling the model to surpass pure imitation learning and to acquire more robust, self-explored navigation behaviors. Extensive experiments demonstrate that VLingNav achieves state-of-the-art performance across a wide range of embodied navigation benchmarks. Notably, VLingNav transfers to real-world robotic platforms in a zero-shot manner, executing various navigation tasks and demonstrating strong cross-domain and cross-task generalization.
PDF60January 15, 2026