StreamVLN: スローファストコンテキストモデリングによるストリーミング視覚言語ナビゲーション

要旨

現実世界の設定におけるVision-and-Language Navigation（VLN）では、エージェントが連続的な視覚ストリームを処理し、言語指示に基づいた低遅延でのアクション生成が求められる。Video-based Large Language Models（Video-LLMs）が最近の進歩を牽引しているが、現在のVideo-LLMに基づくVLN手法は、細かな視覚理解、長期的なコンテキストモデリング、および計算効率の間でトレードオフに直面することが多い。本論文では、StreamVLNを紹介する。これは、視覚、言語、およびアクションの入力を交互に扱うマルチモーダル推論をサポートするために、ハイブリッドなスロー・ファストコンテキストモデリング戦略を採用したストリーミングVLNフレームワークである。ファストストリーミングの対話コンテキストは、アクティブな対話のスライディングウィンドウを通じて迅速なアクション生成を促進し、スローアップデートのメモリコンテキストは、3Dを意識したトークンプルーニング戦略を用いて過去の視覚状態を圧縮する。このスロー・ファスト設計により、StreamVLNは効率的なKVキャッシュの再利用を通じて一貫したマルチターン対話を実現し、長いビデオストリームを限られたコンテキストサイズと推論コストでサポートする。VLN-CEベンチマークでの実験では、安定した低遅延を保ちつつ、最先端の性能を実証し、現実世界での展開における堅牢性と効率性を確保している。プロジェクトページは以下である：https://streamvln.github.io/{https://streamvln.github.io/}。

English

Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions. While Video-based Large Language Models (Video-LLMs) have driven recent progress, current VLN methods based on Video-LLM often face trade-offs among fine-grained visual understanding, long-term context modeling and computational efficiency. We introduce StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs. The fast-streaming dialogue context facilitates responsive action generation through a sliding-window of active dialogues, while the slow-updating memory context compresses historical visual states using a 3D-aware token pruning strategy. With this slow-fast design, StreamVLN achieves coherent multi-turn dialogue through efficient KV cache reuse, supporting long video streams with bounded context size and inference cost. Experiments on VLN-CE benchmarks demonstrate state-of-the-art performance with stable low latency, ensuring robustness and efficiency in real-world deployment. The project page is: https://streamvln.github.io/{https://streamvln.github.io/}.

StreamVLN: スローファストコンテキストモデリングによるストリーミング視覚言語ナビゲーション

StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

要旨

Support