RADIO-ViPE: 동적 환경에서 개방형 어휘 의미론적 SLAM을 위한 온라인 강결합 멀티모달 융합

초록

RADIO-ViPE(Reduce All Domains Into One -- Video Pose Engine)를 소개한다. 이는 온라인 의미론적 SLAM 시스템으로, 기하학 인식 개방형 어휘 기반 연결(open-vocabulary grounding)을 가능하게 하여 동적 환경에서 임의의 자연어 질의를 지역화된 3D 영역 및 객체와 연관시킨다. 캘리브레이션되고 포즈 추정된 RGB-D 입력을 필요로 하는 기존 접근법과 달리, RADIO-ViPE는 사전에 카메라 내부 파라미터, 깊이 센서 또는 포즈 초기화 없이도 원시 단안 RGB 비디오 스트림에서 직접 작동한다. 본 시스템은 통합 기초 모델(예: RADIO)에서 도출된 시각 및 언어를 아우르는 다중 모드 임베딩을 기하학적 장면 정보와 긴밀하게 결합한다. 이러한 결합은 초기화, 최적화 및 팩터 그래프 연결 과정에서 이루어져 다중 모달리티로부터 맵의 일관성을 향상시킨다. 최적화 과정은 능동적으로 움직이는 객체와 에이전트에 의해 이동된 장면 요소(예: 자기 중심적 세션 동안 재배치된 가구)를 모두 처리하도록 설계된 적응형 강건 커널로 감싸진다. 실험 결과, RADIO-ViPE는 동적 TUM-RGBD 벤치마크에서 최첨단 성능을 달성하는 동시에, 캘리브레이션된 데이터와 정적 장면 가정에 의존하는 오프라인 개방형 어휘 방법들과 비교해 경쟁력 있는 성능을 유지함을 보여준다. RADIO-ViPE는 실제 현장 배포에서의 중요한 간극을 메워, 자율 로봇공학과 제약 없는 실제 환경 비디오 스트림을 위한 강건한 개방형 어휘 의미론적 기반 연결을 가능하게 한다. 프로젝트 페이지: https://be2rlab.github.io/radio_vipe

English

We present RADIO-ViPE (Reduce All Domains Into One -- Video Pose Engine), an online semantic SLAM system that enables geometry-aware open-vocabulary grounding, associating arbitrary natural language queries with localized 3D regions and objects in dynamic environments. Unlike existing approaches that require calibrated, posed RGB-D input, RADIO-ViPE operates directly on raw monocular RGB video streams, requiring no prior camera intrinsics, depth sensors, or pose initialization. The system tightly couples multi-modal embeddings -- spanning vision and language -- derived from agglomerative foundation models (e.g., RADIO) with geometric scene information. This coupling takes place in initialization, optimization and factor graph connections to improve the consistency of the map from multiple modalities. The optimization is wrapped within adaptive robust kernels, designed to handle both actively moving objects and agent-displaced scene elements (e.g., furniture rearranged during ego-centric session). Experiments demonstrate that RADIO-ViPE achieves state-of-the-art results on the dynamic TUM-RGBD benchmark while maintaining competitive performance against offline open-vocabulary methods that rely on calibrated data and static scene assumptions. RADIO-ViPE bridges a critical gap in real-world deployment, enabling robust open-vocabulary semantic grounding for autonomous robotics and unconstrained in-the-wild video streams. Project page: https://be2rlab.github.io/radio_vipe

RADIO-ViPE: 동적 환경에서 개방형 어휘 의미론적 SLAM을 위한 온라인 강결합 멀티모달 융합

RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments

초록

Support