ChatPaper.aiChatPaper

AudioTrust:音频大语言模型多维度可信度基准测试

AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

May 22, 2025
作者: Kai Li, Can Shen, Yile Liu, Jirui Han, Kelong Zheng, Xuechao Zou, Zhe Wang, Xingjian Du, Shun Zhang, Hanjun Luo, Yingbin Jin, Xinxin Xing, Ziyang Ma, Yue Liu, Xiaojun Jia, Yifan Zhang, Junfeng Fang, Kun Wang, Yibo Yan, Haoyang Li, Yiming Li, Xiaobin Zhuang, Yang Liu, Haibo Hu, Zhuo Chen, Zhizheng Wu, Xiaolin Hu, Eng-Siong Chng, XiaoFeng Wang, Wenyuan Xu, Wei Dong, Xinfeng Li
cs.AI

摘要

音频大语言模型(ALLMs)的快速发展和广泛应用,亟需对其可信度进行严谨评估。然而,针对这些模型的系统性研究,尤其是涉及音频模态特有风险的评估,仍处于探索初期。现有的评估框架主要集中于文本模态,或仅涵盖有限的安全维度,未能充分考虑音频模态特有的属性与应用场景。为此,我们推出了AudioTrust——首个专为ALLMs设计的多维度可信度评估框架与基准。AudioTrust支持从公平性、幻觉、安全性、隐私性、鲁棒性和认证性这六大核心维度进行全面评估。为深入探究这些维度,AudioTrust围绕18种不同的实验设置构建,其核心是一个精心编制的包含超过4,420个音频/文本样本的数据集,这些样本源自现实场景(如日常对话、紧急呼叫、语音助手交互),专门用于探测ALLMs的多维度可信度。在评估方面,该基准精心设计了9个音频特有的评价指标,并采用大规模自动化流程对模型输出进行客观且可扩展的评分。实验结果表明,当前最先进的开源与闭源ALLMs在面对多种高风险音频场景时的可信度边界与局限,为未来音频模型的安全可信部署提供了宝贵洞见。我们的平台与基准可在https://github.com/JusperLee/AudioTrust获取。
English
The rapid advancement and expanding applications of Audio Large Language Models (ALLMs) demand a rigorous understanding of their trustworthiness. However, systematic research on evaluating these models, particularly concerning risks unique to the audio modality, remains largely unexplored. Existing evaluation frameworks primarily focus on the text modality or address only a restricted set of safety dimensions, failing to adequately account for the unique characteristics and application scenarios inherent to the audio modality. We introduce AudioTrust-the first multifaceted trustworthiness evaluation framework and benchmark specifically designed for ALLMs. AudioTrust facilitates assessments across six key dimensions: fairness, hallucination, safety, privacy, robustness, and authentication. To comprehensively evaluate these dimensions, AudioTrust is structured around 18 distinct experimental setups. Its core is a meticulously constructed dataset of over 4,420 audio/text samples, drawn from real-world scenarios (e.g., daily conversations, emergency calls, voice assistant interactions), specifically designed to probe the multifaceted trustworthiness of ALLMs. For assessment, the benchmark carefully designs 9 audio-specific evaluation metrics, and we employ a large-scale automated pipeline for objective and scalable scoring of model outputs. Experimental results reveal the trustworthiness boundaries and limitations of current state-of-the-art open-source and closed-source ALLMs when confronted with various high-risk audio scenarios, offering valuable insights for the secure and trustworthy deployment of future audio models. Our platform and benchmark are available at https://github.com/JusperLee/AudioTrust.

Summary

AI-Generated Summary

PDF172May 26, 2025