StreamVLN：基於SlowFast上下文建模的串流視覺語言導航

摘要

在現實世界環境中，視覺與語言導航（Vision-and-Language Navigation, VLN）要求智能體能夠處理連續的視覺流，並基於語言指令以低延遲生成動作。儘管基於視頻的大型語言模型（Video-LLMs）推動了最近的進展，但當前基於Video-LLM的VLN方法往往需要在細粒度視覺理解、長期上下文建模和計算效率之間進行權衡。我們提出了StreamVLN，這是一個流式VLN框架，採用了一種混合的慢-快上下文建模策略，以支持對交織的視覺、語言和動作輸入進行多模態推理。快速流動的對話上下文通過活動對話的滑動窗口促進響應式動作生成，而慢速更新的記憶上下文則利用3D感知的令牌剪枝策略壓縮歷史視覺狀態。通過這種慢-快設計，StreamVLN實現了高效的多輪對話，通過重用KV緩存，支持長視頻流，並保持有限的上下文大小和推理成本。在VLN-CE基準測試上的實驗展示了最先進的性能，並確保了在實際部署中的穩定低延遲、魯棒性和效率。項目頁面為：https://streamvln.github.io/{https://streamvln.github.io/}。

English

Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions. While Video-based Large Language Models (Video-LLMs) have driven recent progress, current VLN methods based on Video-LLM often face trade-offs among fine-grained visual understanding, long-term context modeling and computational efficiency. We introduce StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs. The fast-streaming dialogue context facilitates responsive action generation through a sliding-window of active dialogues, while the slow-updating memory context compresses historical visual states using a 3D-aware token pruning strategy. With this slow-fast design, StreamVLN achieves coherent multi-turn dialogue through efficient KV cache reuse, supporting long video streams with bounded context size and inference cost. Experiments on VLN-CE benchmarks demonstrate state-of-the-art performance with stable low latency, ensuring robustness and efficiency in real-world deployment. The project page is: https://streamvln.github.io/{https://streamvln.github.io/}.

StreamVLN：基於SlowFast上下文建模的串流視覺語言導航

StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

摘要

Support