Mem4Nav: 계층적 공간 인식 장단기 메모리 시스템을 통한 도시 환경에서의 비전-언어 내비게이션 성능 향상

초록

대규모 도시 환경에서의 비전-언어 내비게이션(Vision-and-Language Navigation, VLN)은 구현된 에이전트가 복잡한 장면에서 언어적 지시를 이해하고 장기간에 걸쳐 관련 경험을 회상할 것을 요구합니다. 기존의 모듈식 파이프라인은 해석 가능성을 제공하지만 통합된 메모리가 부족하며, 종단 간 (M)LLM 에이전트는 비전과 언어를 융합하는 데 뛰어나지만 고정된 컨텍스트 창과 암묵적 공간 추론에 제약을 받습니다. 우리는 Mem4Nav를 소개합니다. 이는 계층적 공간 인식 장단기 메모리 시스템으로, 모든 VLN 백본을 보강할 수 있습니다. Mem4Nav는 세밀한 복셀 인덱싱을 위한 희소 옥트리와 고수준 랜드마크 연결성을 위한 의미론적 토폴로지 그래프를 융합하며, 이를 가역적 트랜스포머를 통해 임베딩된 학습 가능한 메모리 토큰에 저장합니다. 장기 메모리(LTM)는 옥트리와 그래프 노드 모두에서 역사적 관찰을 압축하고 유지하며, 단기 메모리(STM)는 최근의 다중 모드 항목을 상대 좌표로 캐싱하여 실시간 장애물 회피 및 지역 계획을 가능하게 합니다. 각 단계에서 STM 검색은 동적 컨텍스트를 날카롭게 정제하며, 더 깊은 역사가 필요할 때 LTM 토큰은 무손실로 디코딩되어 과거 임베딩을 재구성합니다. Touchdown과 Map2Seq에서 세 가지 백본(모듈식, 프롬프트 기반 LLM을 사용한 최신 VLN, 스트라이드 어텐션 MLLM을 사용한 최신 VLN)에 대해 평가한 결과, Mem4Nav는 작업 완료율에서 7-13pp 향상, 충분한 SPD 감소, 그리고 >10pp의 nDTW 개선을 보였습니다. 어블레이션 실험은 계층적 지도와 이중 메모리 모듈의 필수성을 확인합니다. 우리의 코드는 https://github.com/tsinghua-fib-lab/Mem4Nav를 통해 오픈소스로 제공됩니다.

English

Vision-and-Language Navigation (VLN) in large-scale urban environments requires embodied agents to ground linguistic instructions in complex scenes and recall relevant experiences over extended time horizons. Prior modular pipelines offer interpretability but lack unified memory, while end-to-end (M)LLM agents excel at fusing vision and language yet remain constrained by fixed context windows and implicit spatial reasoning. We introduce Mem4Nav, a hierarchical spatial-cognition long-short memory system that can augment any VLN backbone. Mem4Nav fuses a sparse octree for fine-grained voxel indexing with a semantic topology graph for high-level landmark connectivity, storing both in trainable memory tokens embedded via a reversible Transformer. Long-term memory (LTM) compresses and retains historical observations at both octree and graph nodes, while short-term memory (STM) caches recent multimodal entries in relative coordinates for real-time obstacle avoidance and local planning. At each step, STM retrieval sharply prunes dynamic context, and, when deeper history is needed, LTM tokens are decoded losslessly to reconstruct past embeddings. Evaluated on Touchdown and Map2Seq across three backbones (modular, state-of-the-art VLN with prompt-based LLM, and state-of-the-art VLN with strided-attention MLLM), Mem4Nav yields 7-13 pp gains in Task Completion, sufficient SPD reduction, and >10 pp nDTW improvement. Ablations confirm the indispensability of both the hierarchical map and dual memory modules. Our codes are open-sourced via https://github.com/tsinghua-fib-lab/Mem4Nav.

Mem4Nav: 계층적 공간 인식 장단기 메모리 시스템을 통한 도시 환경에서의 비전-언어 내비게이션 성능 향상

Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System

초록

Support