JAEGER: シミュレートされた物理環境における統合3Dオーディオ-ビジュアルグラウンディングと推論

要旨

現在の視聴覚大規模言語モデル（AV-LLM）は、主にRGBビデオとモノラル音声に依存した2D知覚に制限されている。この設計選択は根本的な次元の不一致を生み出し、複雑な3D環境における信頼性の高い音源定位と空間推論を妨げている。本研究ではこの制約に対処するため、AV-LLMを3D空間に拡張するフレームワーク「JAEGER」を提案する。本フレームワークはRGB-D観測とマルチチャンネル一次アンビソニクスの統合により、共同的な空間接地と推論を可能にする。中核的な貢献は、神経強度ベクトル（Neural IV）と呼ばれる学習済み空間音響表現である。これは頑健な方向手がかりを符号化し、音源が重畳する不良音響環境下でも到来方向推定を強化する。大規模訓練と体系的な評価を促進するため、模擬物理環境から精選した61kの指示チューニングサンプルから成るベンチマーク「SpatialSceneQA」を提案する。大規模実験により、本手法が多様な空間知覚・推論タスクにおいて一貫して2D中心ベースラインを上回ることを実証し、物理環境におけるAI発展のための明示的3Dモデリングの必要性を強調する。ソースコード、事前学習済みモデルチェックポイント及びデータセットは採択後公開予定である。

English

Current audio-visual large language models (AV-LLMs) are predominantly restricted to 2D perception, relying on RGB video and monaural audio. This design choice introduces a fundamental dimensionality mismatch that precludes reliable source localization and spatial reasoning in complex 3D environments. We address this limitation by presenting JAEGER, a framework that extends AV-LLMs to 3D space, to enable joint spatial grounding and reasoning through the integration of RGB-D observations and multi-channel first-order ambisonics. A core contribution of our work is the neural intensity vector (Neural IV), a learned spatial audio representation that encodes robust directional cues to enhance direction-of-arrival estimation, even in adverse acoustic scenarios with overlapping sources. To facilitate large-scale training and systematic evaluation, we propose SpatialSceneQA, a benchmark of 61k instruction-tuning samples curated from simulated physical environments. Extensive experiments demonstrate that our approach consistently surpasses 2D-centric baselines across diverse spatial perception and reasoning tasks, underscoring the necessity of explicit 3D modelling for advancing AI in physical environments. Our source code, pre-trained model checkpoints and datasets will be released upon acceptance.

JAEGER: シミュレートされた物理環境における統合3Dオーディオ-ビジュアルグラウンディングと推論

JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

要旨

Support