大型语言模型主干中的听觉知识如何塑造音频语言模型:一项整体性评估
How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation
March 19, 2026
作者: Ke-Han Lu, Szu-Wei Fu, Chao-Han Huck Yang, Zhehuai Chen, Sung-Feng Huang, Chih-Kai Yang, Yi-Cheng Lin, Chi-Yuan Hsiao, Wenze Ren, En-Pei Hu, Yu-Han Huang, An-Yu Cheng, Cheng-Han Chiang, Yu Tsao, Yu-Chiang Frank Wang, Hung-yi Lee
cs.AI
摘要
大型语言模型(LLMs)作为大型音频语言模型(LALMs)的知识基础已被广泛应用,然而这些模型通过纯文本预训练掌握了多少听觉知识及其对下游任务的影响尚不明确。我们通过比较不同LLMs在三种设定下的表现来研究这一差距:纯文本设定下的(1)直接探测——使用专门评估听觉知识广度与深度的基准数据集AKB-2000;(2)级联评估——让LLMs对音频描述模型生成的文本进行推理;以及(3)音频接地评估——将各LLM与音频编码器结合微调为LALM。研究发现,不同模型家族的听觉知识存在显著差异,且纯文本评估结果与音频性能高度相关。本工作为全面理解LLMs在音频研究中的作用提供了实证依据。
English
Large language models (LLMs) have been widely used as knowledge backbones of Large Audio Language Models (LALMs), yet how much auditory knowledge they encode through text-only pre-training and how this affects downstream performance remains unclear. We study this gap by comparing different LLMs under two text-only and one audio-grounded setting: (1) direct probing on AKB-2000, a curated benchmark testing the breadth and depth of auditory knowledge; (2) cascade evaluation, where LLMs reason over text descriptions from an audio captioner; and (3) audio-grounded evaluation, where each LLM is fine-tuned into a Large Audio Language Model (LALM) with an audio encoder. Our findings reveal that auditory knowledge varies substantially across families, and text-only results are strongly correlated with audio performance. Our work provides empirical grounding for a comprehensive understanding of LLMs in audio research.