ChatPaper.aiChatPaper

FantasyVLN:面向视觉语言导航的统一多模态思维链推理框架

FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation

January 20, 2026
作者: Jing Zuo, Lingzhou Mu, Fan Jiang, Chengcheng Ma, Mu Xu, Yonggang Qi
cs.AI

摘要

在视觉与语言导航任务中实现人类水平性能,要求具身智能体能够同时理解多模态指令与视觉空间语境,并对长动作序列进行推理。近期研究如NavCoT与NavGPT-2证明了思维链推理在提升可解释性与长程规划能力方面的潜力。此外,OctoNav-R1与CoT-VLA等多模态扩展进一步验证了思维链作为实现类人导航推理的有效路径。然而现有方法存在明显缺陷:纯文本思维链缺乏空间锚定且易过拟合稀疏标注的推理步骤,而多模态思维链因生成想象视觉观察导致标记激增,难以实现实时导航。本文提出FantasyVLN——一种保留思维链推理优势且无显式标记开销的统一隐式推理框架。具体而言,在思维链推理训练中,通过预训练视觉自回归模型将想象视觉标记编码至紧凑的潜在空间,模型在统一多思维链策略下联合学习文本、视觉及多模态推理模式。在推理阶段,模型直接实现指令到动作的映射,同时保持推理感知的表征能力。在LH-VLN数据集上的大量实验表明,本方法实现了兼具推理感知与实时性的导航,较显式思维链方法在提升成功率与效率的同时,将推理延迟降低了一个数量级。
English
Achieving human-level performance in Vision-and-Language Navigation (VLN) requires an embodied agent to jointly understand multimodal instructions and visual-spatial context while reasoning over long action sequences. Recent works, such as NavCoT and NavGPT-2, demonstrate the potential of Chain-of-Thought (CoT) reasoning for improving interpretability and long-horizon planning. Moreover, multimodal extensions like OctoNav-R1 and CoT-VLA further validate CoT as a promising pathway toward human-like navigation reasoning. However, existing approaches face critical drawbacks: purely textual CoTs lack spatial grounding and easily overfit to sparse annotated reasoning steps, while multimodal CoTs incur severe token inflation by generating imagined visual observations, making real-time navigation impractical. In this work, we propose FantasyVLN, a unified implicit reasoning framework that preserves the benefits of CoT reasoning without explicit token overhead. Specifically, imagined visual tokens are encoded into a compact latent space using a pretrained Visual AutoRegressor (VAR) during CoT reasoning training, and the model jointly learns from textual, visual, and multimodal CoT modes under a unified multi-CoT strategy. At inference, our model performs direct instruction-to-action mapping while still enjoying reasoning-aware representations. Extensive experiments on LH-VLN show that our approach achieves reasoning-aware yet real-time navigation, improving success rates and efficiency while reducing inference latency by an order of magnitude compared to explicit CoT methods.
PDF41January 22, 2026