连续环境下的语义视听导航

摘要

视听导航技术使具身智能体能够利用听觉与视觉线索向声源目标移动。然而现有方法大多依赖预计算的房间脉冲响应进行双耳音频渲染，将智能体限制在离散网格位置，导致空间不连续的观测结果。为构建更真实的场景，我们提出连续环境下的语义视听导航框架（SAVN-CE），使智能体可在三维空间自由移动，感知时空连贯的视听流。在此设定下，目标声源可能间歇性静默或完全停止发声，导致智能体丢失目标信息。为解决这一挑战，我们提出基于多模态Transformer的MAGNet模型，该模型联合编码空间与语义目标表征，并通过整合历史上下文与自运动线索实现记忆增强的目标推理。综合实验表明，MAGNet显著优于现有最优方法，成功率绝对提升幅度达12.1%。实验结果同时验证了模型对短时声音和长距离导航场景的强鲁棒性。代码已开源：https://github.com/yichenzeng24/SAVN-CE。

English

Audio-visual navigation enables embodied agents to navigate toward sound-emitting targets by leveraging both auditory and visual cues. However, most existing approaches rely on precomputed room impulse responses (RIRs) for binaural audio rendering, restricting agents to discrete grid positions and leading to spatially discontinuous observations. To establish a more realistic setting, we introduce Semantic Audio-Visual Navigation in Continuous Environments (SAVN-CE), where agents can move freely in 3D spaces and perceive temporally and spatially coherent audio-visual streams. In this setting, targets may intermittently become silent or stop emitting sound entirely, causing agents to lose goal information. To tackle this challenge, we propose MAGNet, a multimodal transformer-based model that jointly encodes spatial and semantic goal representations and integrates historical context with self-motion cues to enable memory-augmented goal reasoning. Comprehensive experiments demonstrate that MAGNet significantly outperforms state-of-the-art methods, achieving up to a 12.1\% absolute improvement in success rate. These results also highlight its robustness to short-duration sounds and long-distance navigation scenarios. The code is available at https://github.com/yichenzeng24/SAVN-CE.