連続環境における意味的視聴覚ナビゲーション

要旨

視覚聴覚ナビゲーションは、聴覚的・視覚的手がかりを活用することで、エンボディードエージェントが音源ターゲットへと移動することを可能にする。しかし、既存手法の多くはバイノーラル音響レンダリングに事前計算された室内インパルス応答（RIR）に依存しており、エージェントを離散的なグリッド位置に制限し、空間的に不連続な観測を引き起こしている。より現実的な設定を確立するため、本論文では、エージェントが3D空間内を自由に移動し、時間的・空間的に連続した視覚聴覚ストリームを認識できるSemantic Audio-Visual Navigation in Continuous Environments (SAVN-CE)を提案する。この設定では、ターゲットが断続的に沈黙したり、発音を完全に停止したりするため、エージェントは目標情報を失う可能性がある。この課題に対処するため、我々はMAGNetを提案する。これは、空間的・意味的目標表現を共同で符号化し、履歴コンテキストと自己運動手がかりを統合することで、メモリ拡張型の目標推論を実現するマルチモーダルトランスフォーマーベースのモデルである。包括的な実験により、MAGNetが既存の最先端手法を大幅に上回り、成功率で最大12.1%の絶対改善を達成することを実証した。これらの結果は、短時間音響や長距離ナビゲーションシナリオに対するその頑健性も浮き彫りにしている。コードはhttps://github.com/yichenzeng24/SAVN-CE で公開されている。

English

Audio-visual navigation enables embodied agents to navigate toward sound-emitting targets by leveraging both auditory and visual cues. However, most existing approaches rely on precomputed room impulse responses (RIRs) for binaural audio rendering, restricting agents to discrete grid positions and leading to spatially discontinuous observations. To establish a more realistic setting, we introduce Semantic Audio-Visual Navigation in Continuous Environments (SAVN-CE), where agents can move freely in 3D spaces and perceive temporally and spatially coherent audio-visual streams. In this setting, targets may intermittently become silent or stop emitting sound entirely, causing agents to lose goal information. To tackle this challenge, we propose MAGNet, a multimodal transformer-based model that jointly encodes spatial and semantic goal representations and integrates historical context with self-motion cues to enable memory-augmented goal reasoning. Comprehensive experiments demonstrate that MAGNet significantly outperforms state-of-the-art methods, achieving up to a 12.1\% absolute improvement in success rate. These results also highlight its robustness to short-duration sounds and long-distance navigation scenarios. The code is available at https://github.com/yichenzeng24/SAVN-CE.