StreamVLN: SlowFast 컨텍스트 모델링을 통한 스트리밍 비전-언어 내비게이션

초록

실세계 환경에서의 비전-언어 내비게이션(Vision-and-Language Navigation, VLN)은 에이전트가 연속적인 시각 스트림을 처리하고 언어 지시에 기반하여 낮은 지연 시간으로 동작을 생성할 것을 요구한다. 비디오 기반 대형 언어 모델(Video-based Large Language Models, Video-LLMs)이 최근의 진전을 이끌었지만, 현재의 Video-LLM 기반 VLN 방법들은 세밀한 시각 이해, 장기적 문맥 모델링 및 계산 효율성 간의 트레이드오프에 직면해 있다. 우리는 StreamVLN을 소개하는데, 이는 인터리브된 비전, 언어 및 동작 입력에 대한 다중 모달 추론을 지원하기 위해 하이브리드 느린-빠른 문맥 모델링 전략을 채택한 스트리밍 VLN 프레임워크이다. 빠른 스트리밍 대화 문맥은 활성 대화의 슬라이딩 윈도우를 통해 반응적인 동작 생성을 용이하게 하며, 느린 업데이트 메모리 문맥은 3D 인식 토큰 프루닝 전략을 사용하여 역사적 시각 상태를 압축한다. 이 느린-빠른 설계를 통해 StreamVLN은 효율적인 KV 캐시 재사용을 통해 일관된 다중 턴 대화를 달성하며, 제한된 문맥 크기와 추론 비용으로 긴 비디오 스트림을 지원한다. VLN-CE 벤치마크에서의 실험은 안정적인 낮은 지연 시간과 함께 최첨단 성능을 보여주며, 실세계 배포에서의 견고성과 효율성을 보장한다. 프로젝트 페이지는 https://streamvln.github.io/이다.

English

Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions. While Video-based Large Language Models (Video-LLMs) have driven recent progress, current VLN methods based on Video-LLM often face trade-offs among fine-grained visual understanding, long-term context modeling and computational efficiency. We introduce StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs. The fast-streaming dialogue context facilitates responsive action generation through a sliding-window of active dialogues, while the slow-updating memory context compresses historical visual states using a 3D-aware token pruning strategy. With this slow-fast design, StreamVLN achieves coherent multi-turn dialogue through efficient KV cache reuse, supporting long video streams with bounded context size and inference cost. Experiments on VLN-CE benchmarks demonstrate state-of-the-art performance with stable low latency, ensuring robustness and efficiency in real-world deployment. The project page is: https://streamvln.github.io/{https://streamvln.github.io/}.

StreamVLN: SlowFast 컨텍스트 모델링을 통한 스트리밍 비전-언어 내비게이션

StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

초록

Support