FantasyVLN:面向视觉语言导航的统一多模态思维链推理框架
FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation
January 20, 2026
作者: Jing Zuo, Lingzhou Mu, Fan Jiang, Chengcheng Ma, Mu Xu, Yonggang Qi
cs.AI
摘要
为实现视觉与语言导航(VLN)中的人类水平性能,智能体需在理解多模态指令与视觉空间上下文的同时,完成长序列动作推理。近期研究如NavCoT与NavGPT-2揭示了思维链(CoT)推理在提升可解释性与长程规划能力方面的潜力。而OctoNav-R1、CoT-VLA等多模态扩展工作进一步验证了CoT作为类人导航推理路径的可行性。然而现有方法存在明显缺陷:纯文本CoT缺乏空间锚点易过度拟合稀疏标注的推理步骤,多模态CoT因生成虚拟视觉观测导致标记激增,难以实现实时导航。本文提出FantasyVLN——一种保留CoT推理优势且无需显式标记开销的统一隐式推理框架。具体而言,在CoT推理训练阶段,通过预训练视觉自回归模型将虚拟视觉标记编码至紧凑潜空间,模型在统一多CoT策略下联合学习文本、视觉及多模态CoT模式。推理时,模型直接实现指令到动作的映射,同时保持推理感知的表征能力。在LH-VLN数据集上的大量实验表明,本方法在实现推理感知的同时保证实时导航,较显式CoT方法将推理延迟降低一个数量级,并显著提升成功率与导航效率。
English
Achieving human-level performance in Vision-and-Language Navigation (VLN) requires an embodied agent to jointly understand multimodal instructions and visual-spatial context while reasoning over long action sequences. Recent works, such as NavCoT and NavGPT-2, demonstrate the potential of Chain-of-Thought (CoT) reasoning for improving interpretability and long-horizon planning. Moreover, multimodal extensions like OctoNav-R1 and CoT-VLA further validate CoT as a promising pathway toward human-like navigation reasoning. However, existing approaches face critical drawbacks: purely textual CoTs lack spatial grounding and easily overfit to sparse annotated reasoning steps, while multimodal CoTs incur severe token inflation by generating imagined visual observations, making real-time navigation impractical. In this work, we propose FantasyVLN, a unified implicit reasoning framework that preserves the benefits of CoT reasoning without explicit token overhead. Specifically, imagined visual tokens are encoded into a compact latent space using a pretrained Visual AutoRegressor (VAR) during CoT reasoning training, and the model jointly learns from textual, visual, and multimodal CoT modes under a unified multi-CoT strategy. At inference, our model performs direct instruction-to-action mapping while still enjoying reasoning-aware representations. Extensive experiments on LH-VLN show that our approach achieves reasoning-aware yet real-time navigation, improving success rates and efficiency while reducing inference latency by an order of magnitude compared to explicit CoT methods.