StreamVLN：基于慢快上下文建模的流式视觉语言导航

摘要

在现实世界场景中，视觉与语言导航（VLN）要求智能体能够处理连续的视觉流，并基于语言指令以低延迟生成动作。尽管基于视频的大型语言模型（Video-LLMs）推动了该领域的最新进展，但当前基于Video-LLM的VLN方法往往在细粒度视觉理解、长期上下文建模与计算效率之间面临权衡。我们提出了StreamVLN，一种流式VLN框架，采用慢-快混合上下文建模策略，支持对交织的视觉、语言及动作输入进行多模态推理。快速流式对话上下文通过活动对话的滑动窗口促进响应式动作生成，而慢速更新的记忆上下文则利用三维感知的令牌剪枝策略压缩历史视觉状态。凭借这一慢-快设计，StreamVLN通过高效的键值缓存重用实现了连贯的多轮对话，支持长视频流的同时保持上下文大小与推理成本的有界性。在VLN-CE基准测试上的实验展示了其顶尖性能与稳定的低延迟，确保了实际部署中的鲁棒性与效率。项目页面为：https://streamvln.github.io/{https://streamvln.github.io/}。

English

Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions. While Video-based Large Language Models (Video-LLMs) have driven recent progress, current VLN methods based on Video-LLM often face trade-offs among fine-grained visual understanding, long-term context modeling and computational efficiency. We introduce StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs. The fast-streaming dialogue context facilitates responsive action generation through a sliding-window of active dialogues, while the slow-updating memory context compresses historical visual states using a 3D-aware token pruning strategy. With this slow-fast design, StreamVLN achieves coherent multi-turn dialogue through efficient KV cache reuse, supporting long video streams with bounded context size and inference cost. Experiments on VLN-CE benchmarks demonstrate state-of-the-art performance with stable low latency, ensuring robustness and efficiency in real-world deployment. The project page is: https://streamvln.github.io/{https://streamvln.github.io/}.

StreamVLN：基于慢快上下文建模的流式视觉语言导航

StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

摘要

Support