AudioTrust:音频大语言模型多维度可信度基准测试
AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models
May 22, 2025
作者: Kai Li, Can Shen, Yile Liu, Jirui Han, Kelong Zheng, Xuechao Zou, Zhe Wang, Xingjian Du, Shun Zhang, Hanjun Luo, Yingbin Jin, Xinxin Xing, Ziyang Ma, Yue Liu, Xiaojun Jia, Yifan Zhang, Junfeng Fang, Kun Wang, Yibo Yan, Haoyang Li, Yiming Li, Xiaobin Zhuang, Yang Liu, Haibo Hu, Zhuo Chen, Zhizheng Wu, Xiaolin Hu, Eng-Siong Chng, XiaoFeng Wang, Wenyuan Xu, Wei Dong, Xinfeng Li
cs.AI
摘要
音频大语言模型(ALLMs)的快速发展和广泛应用,亟需对其可信度进行严谨评估。然而,针对这些模型的系统性研究,尤其是涉及音频模态特有风险的评估,仍处于探索初期。现有的评估框架主要集中于文本模态,或仅涵盖有限的安全维度,未能充分考虑音频模态特有的属性与应用场景。为此,我们推出了AudioTrust——首个专为ALLMs设计的多维度可信度评估框架与基准。AudioTrust支持从公平性、幻觉、安全性、隐私性、鲁棒性和认证性这六大核心维度进行全面评估。为深入探究这些维度,AudioTrust围绕18种不同的实验设置构建,其核心是一个精心编制的包含超过4,420个音频/文本样本的数据集,这些样本源自现实场景(如日常对话、紧急呼叫、语音助手交互),专门用于探测ALLMs的多维度可信度。在评估方面,该基准精心设计了9个音频特有的评价指标,并采用大规模自动化流程对模型输出进行客观且可扩展的评分。实验结果表明,当前最先进的开源与闭源ALLMs在面对多种高风险音频场景时的可信度边界与局限,为未来音频模型的安全可信部署提供了宝贵洞见。我们的平台与基准可在https://github.com/JusperLee/AudioTrust获取。
English
The rapid advancement and expanding applications of Audio Large Language
Models (ALLMs) demand a rigorous understanding of their trustworthiness.
However, systematic research on evaluating these models, particularly
concerning risks unique to the audio modality, remains largely unexplored.
Existing evaluation frameworks primarily focus on the text modality or address
only a restricted set of safety dimensions, failing to adequately account for
the unique characteristics and application scenarios inherent to the audio
modality. We introduce AudioTrust-the first multifaceted trustworthiness
evaluation framework and benchmark specifically designed for ALLMs. AudioTrust
facilitates assessments across six key dimensions: fairness, hallucination,
safety, privacy, robustness, and authentication. To comprehensively evaluate
these dimensions, AudioTrust is structured around 18 distinct experimental
setups. Its core is a meticulously constructed dataset of over 4,420 audio/text
samples, drawn from real-world scenarios (e.g., daily conversations, emergency
calls, voice assistant interactions), specifically designed to probe the
multifaceted trustworthiness of ALLMs. For assessment, the benchmark carefully
designs 9 audio-specific evaluation metrics, and we employ a large-scale
automated pipeline for objective and scalable scoring of model outputs.
Experimental results reveal the trustworthiness boundaries and limitations of
current state-of-the-art open-source and closed-source ALLMs when confronted
with various high-risk audio scenarios, offering valuable insights for the
secure and trustworthy deployment of future audio models. Our platform and
benchmark are available at https://github.com/JusperLee/AudioTrust.Summary
AI-Generated Summary