ChatPaper.aiChatPaper

声纳时刻:音频-语言模型在音频地理定位任务中的基准测试

The Sonar Moment: Benchmarking Audio-Language Models in Audio Geo-Localization

January 6, 2026
作者: Ruixing Zhang, Zihan Liu, Leilei Sun, Tongyu Zhu, Weifeng Lv
cs.AI

摘要

地理定位旨在推断给定信号的地理来源。在计算机视觉领域,地理定位已成为组合推理能力的重要基准,并与公共安全密切相关。相比之下,音频地理定位的发展长期受限于高质量音频-位置配对数据的缺乏。为填补这一空白,我们推出AGL1K——首个面向音频语言模型的音频地理定位基准数据集,覆盖72个国家及地区。为从众包平台筛选具有可靠定位价值的样本,我们提出音频可定位性度量指标,通过量化每条录音的信息丰富度,最终精选出1,444段音频片段。对16个音频语言模型的评估表明,此类模型已显现出音频地理定位能力。研究发现:闭源模型显著优于开源模型;语言线索常作为预测支架占据主导地位。我们进一步分析了音频语言模型的推理路径、区域偏见、错误成因以及可定位性指标的可解释性。总体而言,AGL1K为音频地理定位建立了基准,有望推动音频语言模型发展出更强大的地理空间推理能力。
English
Geo-localization aims to infer the geographic origin of a given signal. In computer vision, geo-localization has served as a demanding benchmark for compositional reasoning and is relevant to public safety. In contrast, progress on audio geo-localization has been constrained by the lack of high-quality audio-location pairs. To address this gap, we introduce AGL1K, the first audio geo-localization benchmark for audio language models (ALMs), spanning 72 countries and territories. To extract reliably localizable samples from a crowd-sourced platform, we propose the Audio Localizability metric that quantifies the informativeness of each recording, yielding 1,444 curated audio clips. Evaluations on 16 ALMs show that ALMs have emerged with audio geo-localization capability. We find that closed-source models substantially outperform open-source models, and that linguistic clues often dominate as a scaffold for prediction. We further analyze ALMs' reasoning traces, regional bias, error causes, and the interpretability of the localizability metric. Overall, AGL1K establishes a benchmark for audio geo-localization and may advance ALMs with better geospatial reasoning capability.
PDF11January 8, 2026