JAEGER：模拟物理环境中的联合三维视听定位与推理

摘要

当前音视频大语言模型（AV-LLMs）主要局限于二维感知，依赖RGB视频和单声道音频。这种设计选择引入了根本性的维度失配问题，导致在复杂三维环境中无法实现可靠的声源定位与空间推理。为突破此局限，我们提出JAEGER框架，通过整合RGB-D观测数据与多通道一阶Ambisonics音频，将AV-LLMs扩展至三维空间以实现联合空间定位与推理。本研究的核心贡献是神经强度向量（Neural IV），这是一种可学习的空间音频表征，能够编码强鲁棒性的方向线索以增强到达方向估计，即使在声源重叠的复杂声学场景中仍能保持优异性能。为支持大规模训练与系统化评估，我们构建了SpatialSceneQA基准数据集，包含从模拟物理环境中精选的6.1万条指令调优样本。大量实验表明，我们的方法在多样化的空间感知与推理任务中持续超越以二维为中心的基线模型，印证了显式三维建模对推进物理环境人工智能发展的必要性。相关源代码、预训练模型检查点及数据集将在论文录用后公开发布。

English

Current audio-visual large language models (AV-LLMs) are predominantly restricted to 2D perception, relying on RGB video and monaural audio. This design choice introduces a fundamental dimensionality mismatch that precludes reliable source localization and spatial reasoning in complex 3D environments. We address this limitation by presenting JAEGER, a framework that extends AV-LLMs to 3D space, to enable joint spatial grounding and reasoning through the integration of RGB-D observations and multi-channel first-order ambisonics. A core contribution of our work is the neural intensity vector (Neural IV), a learned spatial audio representation that encodes robust directional cues to enhance direction-of-arrival estimation, even in adverse acoustic scenarios with overlapping sources. To facilitate large-scale training and systematic evaluation, we propose SpatialSceneQA, a benchmark of 61k instruction-tuning samples curated from simulated physical environments. Extensive experiments demonstrate that our approach consistently surpasses 2D-centric baselines across diverse spatial perception and reasoning tasks, underscoring the necessity of explicit 3D modelling for advancing AI in physical environments. Our source code, pre-trained model checkpoints and datasets will be released upon acceptance.

JAEGER：模拟物理环境中的联合三维视听定位与推理

JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

摘要

Support