ChatPaper.aiChatPaper

地面缓行,思维疾驰:面向泛化视觉语言导航的双系统基础模型

Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation

December 9, 2025
作者: Meng Wei, Chenyang Wan, Jiaqi Peng, Xiqian Yu, Yuqiang Yang, Delin Feng, Wenzhe Cai, Chenming Zhu, Tai Wang, Jiangmiao Pang, Xihui Liu
cs.AI

摘要

尽管近期的大型视觉语言模型(VLM)在视觉语言导航(VLN)领域的泛化能力有所提升,但现有方法通常依赖端到端管道,直接将视觉语言输入映射为短视程的离散动作。此类设计常导致运动轨迹碎片化、延迟较高,且难以应对动态避障等现实挑战。我们提出DualVLN——首个双系统VLN基础模型,通过协同整合高层推理与低层动作执行实现突破。系统2作为基于VLM的全局规划器,通过基于图像的推理预测中程航点目标,实现“慢思考”;系统1作为轻量级多模态条件扩散变换器策略,则通过融合系统2提供的显式像素目标与潜在特征生成平滑精准的轨迹,实现“快行动”。这种双系统设计可在复杂动态环境中实现稳健的实时控制与自适应局部决策。通过解耦训练,VLM保持了泛化能力,而系统1则实现了可解释且高效的局部导航。DualVLN在所有VLN基准测试中均超越现有方法,真实环境实验进一步验证了其在动态环境中具备的长视程规划能力与实时适应性。
English
While recent large vision-language models (VLMs) have improved generalization in vision-language navigation (VLN), existing methods typically rely on end-to-end pipelines that map vision-language inputs directly to short-horizon discrete actions. Such designs often produce fragmented motions, incur high latency, and struggle with real-world challenges like dynamic obstacle avoidance. We propose DualVLN, the first dual-system VLN foundation model that synergistically integrates high-level reasoning with low-level action execution. System 2, a VLM-based global planner, "grounds slowly" by predicting mid-term waypoint goals via image-grounded reasoning. System 1, a lightweight, multi-modal conditioning Diffusion Transformer policy, "moves fast" by leveraging both explicit pixel goals and latent features from System 2 to generate smooth and accurate trajectories. The dual-system design enables robust real-time control and adaptive local decision-making in complex, dynamic environments. By decoupling training, the VLM retains its generalization, while System 1 achieves interpretable and effective local navigation. DualVLN outperforms prior methods across all VLN benchmarks and real-world experiments demonstrate robust long-horizon planning and real-time adaptability in dynamic environments.
PDF31December 11, 2025