ChatPaper.aiChatPaper

STAR-Bench:探索深度时空推理的音频4D智能基准

STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence

October 28, 2025
作者: Zihan Liu, Zhikang Niu, Qiuyang Xiao, Zhisheng Zheng, Ruoqi Yuan, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Jianze Liang, Xie Chen, Leilei Sun, Dahua Lin, Jiaqi Wang
cs.AI

摘要

尽管多模态大语言模型和大型音频语言模型发展迅速,但现有音频基准主要测试可从文本描述中恢复的语义信息,这掩盖了模型在细粒度感知推理方面的缺陷。我们正式提出"音频四维智能"概念,即对声音在时间与三维空间中动态变化的推理能力,并推出STAR-Bench基准进行量化评估。该基准融合基础听觉感知(包含绝对与相对两种判断机制下的六种属性)与整体时空推理(涵盖连续/离散过程的片段重组任务,以及静态定位、多源关系、动态轨迹三类空间任务)两大评估维度。我们通过两种数据构建方法确保样本质量:基础任务采用程序化合成与物理仿真音频;整体推理数据则遵循四阶段流程,包含人工标注与基于人类表现效果的最终筛选。相较于现有基准仅靠文本回答导致的轻微准确率下降,STAR-Bench在时序(-31.5%)与空间(-35.2%)任务上引发更显著的性能落差,证明其聚焦于语言难以描述的感知线索。对19个模型的评估揭示了与人类的显著差距及能力分层:闭源模型受限于细粒度感知能力,而开源模型在感知、知识、推理三个维度全面落后。STAR-Bench为开发具有更强物理世界理解能力的未来模型提供了关键洞见与清晰路径。
English
Despite rapid progress in Multi-modal Large Language Models and Large Audio-Language Models, existing audio benchmarks largely test semantics that can be recovered from text captions, masking deficits in fine-grained perceptual reasoning. We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space, and introduce STAR-Bench to measure it. STAR-Bench combines a Foundational Acoustic Perception setting (six attributes under absolute and relative regimes) with a Holistic Spatio-Temporal Reasoning setting that includes segment reordering for continuous and discrete processes and spatial tasks spanning static localization, multi-source relations, and dynamic trajectories. Our data curation pipeline uses two methods to ensure high-quality samples. For foundational tasks, we use procedurally synthesized and physics-simulated audio. For holistic data, we follow a four-stage process that includes human annotation and final selection based on human performance. Unlike prior benchmarks where caption-only answering reduces accuracy slightly, STAR-Bench induces far larger drops (-31.5\% temporal, -35.2\% spatial), evidencing its focus on linguistically hard-to-describe cues. Evaluating 19 models reveals substantial gaps compared with humans and a capability hierarchy: closed-source models are bottlenecked by fine-grained perception, while open-source models lag across perception, knowledge, and reasoning. Our STAR-Bench provides critical insights and a clear path forward for developing future models with a more robust understanding of the physical world.
PDF181December 1, 2025