STAR-Bench：探索深度时空推理的音频4D智能基准

摘要

尽管多模态大语言模型与大规模音频语言模型发展迅速，但现有音频基准主要测试可从文本描述中还原的语义信息，这掩盖了模型在细粒度感知推理方面的缺陷。我们正式提出"音频四维智能"概念——即对声音在时间与三维空间中动态变化的推理能力，并推出STAR-Bench基准进行量化评估。该基准将基础听觉感知（包含绝对与相对两种模式下的六种属性）与整体时空推理相结合，后者涵盖连续/离散过程的片段重组任务，以及静态定位、多源关系、动态轨迹等空间任务。我们通过双路径数据构建流程确保样本质量：基础任务采用程序化合成与物理模拟音频；整体数据则经过四阶段构建流程，包含人工标注与基于人类表现的最终筛选。与现有基准中仅凭文本回答导致准确率轻微下降不同，STAR-Bench引发显著性能落差（时序任务-31.5%，空间任务-35.2%），证明其聚焦于语言难以描述的感知线索。对19个模型的评估揭示了与人类的巨大差距及能力分层：闭源模型受限于细粒度感知能力，开源模型则在感知、知识、推理三个维度全面落后。STAR-Bench为开发具有物理世界稳健理解能力的新一代模型提供了关键洞见与明确路径。

English

Despite rapid progress in Multi-modal Large Language Models and Large Audio-Language Models, existing audio benchmarks largely test semantics that can be recovered from text captions, masking deficits in fine-grained perceptual reasoning. We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space, and introduce STAR-Bench to measure it. STAR-Bench combines a Foundational Acoustic Perception setting (six attributes under absolute and relative regimes) with a Holistic Spatio-Temporal Reasoning setting that includes segment reordering for continuous and discrete processes and spatial tasks spanning static localization, multi-source relations, and dynamic trajectories. Our data curation pipeline uses two methods to ensure high-quality samples. For foundational tasks, we use procedurally synthesized and physics-simulated audio. For holistic data, we follow a four-stage process that includes human annotation and final selection based on human performance. Unlike prior benchmarks where caption-only answering reduces accuracy slightly, STAR-Bench induces far larger drops (-31.5\% temporal, -35.2\% spatial), evidencing its focus on linguistically hard-to-describe cues. Evaluating 19 models reveals substantial gaps compared with humans and a capability hierarchy: closed-source models are bottlenecked by fine-grained perception, while open-source models lag across perception, knowledge, and reasoning. Our STAR-Bench provides critical insights and a clear path forward for developing future models with a more robust understanding of the physical world.