연속 환경에서의 의미론적 오디오-시각 항법

초록

시청각 내비게이션은 구현된 에이전트가 청각 및 시각 단서를 활용하여 음원 대상으로 이동할 수 있게 합니다. 그러나 기존 대부분의 접근법은 양이음향 오디오 렌더링에 사전 계산된 실음향 반응(RIR)에 의존하여, 에이전트를 이산적 격자 위치로 제한하고 공간적으로 불연속적인 관측을 초래합니다. 보다 현실적인 환경을 구축하기 위해, 우리는 연속 환경에서의 의미론적 시청각 내비게이션(SAVN-CE)을 제안합니다. 이 환경에서 에이전트는 3D 공간 내에서 자유롭게 이동하며 시간적, 공간적으로 일관된 시청각 스트림을 인지할 수 있습니다. 본 환경에서는 목표물이 간헐적으로 침묵하거나 소리 발생을 완전히 중단하여 에이전트가 목표 정보를 상실할 수 있습니다. 이러한 문제를 해결하기 위해, 우리는 공간적 및 의미론적 목표 표현을 공동으로 인코딩하고 역사적 맥락과 자체 운동 단서를 통합하여 메모리 증강 목표 추론을 가능하게 하는 다중모드 변환기 기반 모델인 MAGNet을 제안합니다. 포괄적 실험 결과, MAGNet이 최첨단 방법들을 크게 능가하며 성공률에서 최대 12.1%의 절대적 향상을 달성함을 보여줍니다. 이러한 결과는 또한 짧은 지속 시간 소음 및 장거리 내비게이션 시나리오에 대한 MAGNet의 강건성을 부각합니다. 코드는 https://github.com/yichenzeng24/SAVN-CE에서 이용 가능합니다.

English

Audio-visual navigation enables embodied agents to navigate toward sound-emitting targets by leveraging both auditory and visual cues. However, most existing approaches rely on precomputed room impulse responses (RIRs) for binaural audio rendering, restricting agents to discrete grid positions and leading to spatially discontinuous observations. To establish a more realistic setting, we introduce Semantic Audio-Visual Navigation in Continuous Environments (SAVN-CE), where agents can move freely in 3D spaces and perceive temporally and spatially coherent audio-visual streams. In this setting, targets may intermittently become silent or stop emitting sound entirely, causing agents to lose goal information. To tackle this challenge, we propose MAGNet, a multimodal transformer-based model that jointly encodes spatial and semantic goal representations and integrates historical context with self-motion cues to enable memory-augmented goal reasoning. Comprehensive experiments demonstrate that MAGNet significantly outperforms state-of-the-art methods, achieving up to a 12.1\% absolute improvement in success rate. These results also highlight its robustness to short-duration sounds and long-distance navigation scenarios. The code is available at https://github.com/yichenzeng24/SAVN-CE.