ChatPaper.aiChatPaper

声纳时刻:音频-语言模型在地理定位任务中的基准测试

The Sonar Moment: Benchmarking Audio-Language Models in Audio Geo-Localization

January 6, 2026
作者: Ruixing Zhang, Zihan Liu, Leilei Sun, Tongyu Zhu, Weifeng Lv
cs.AI

摘要

地理定位旨在推斷給定信號的地理來源。在計算機視覺領域,地理定位已成為組合推理能力的高要求基準測試,並與公共安全密切相關。相比之下,音頻地理定位的發展長期受制於高質量音頻-位置配對數據的匱乏。為解決這一問題,我們推出首個面向音頻語言模型(ALM)的音頻地理定位基準數據集AGL1K,涵蓋72個國家和地區。為從眾包平台篩選出具有可靠定位價值的樣本,我們提出音頻可定位性指標來量化每段錄音的信息含量,最終精選出1,444段音頻片段。對16個ALM的評估表明,現有ALM已初步具備音頻地理定位能力。研究發現閉源模型顯著優於開源模型,且語言線索常作為預測的主要推理支撐。我們進一步分析了ALM的推理路徑、區域偏差、錯誤成因以及可定位性指標的可解釋性。總體而言,AGL1K為音頻地理定位建立了基準測試框架,有望推動ALM提升地理空間推理能力。
English
Geo-localization aims to infer the geographic origin of a given signal. In computer vision, geo-localization has served as a demanding benchmark for compositional reasoning and is relevant to public safety. In contrast, progress on audio geo-localization has been constrained by the lack of high-quality audio-location pairs. To address this gap, we introduce AGL1K, the first audio geo-localization benchmark for audio language models (ALMs), spanning 72 countries and territories. To extract reliably localizable samples from a crowd-sourced platform, we propose the Audio Localizability metric that quantifies the informativeness of each recording, yielding 1,444 curated audio clips. Evaluations on 16 ALMs show that ALMs have emerged with audio geo-localization capability. We find that closed-source models substantially outperform open-source models, and that linguistic clues often dominate as a scaffold for prediction. We further analyze ALMs' reasoning traces, regional bias, error causes, and the interpretability of the localizability metric. Overall, AGL1K establishes a benchmark for audio geo-localization and may advance ALMs with better geospatial reasoning capability.
PDF11January 8, 2026