AudioTrust:音頻大型語言模型多面向可信度的基準測試
AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models
May 22, 2025
作者: Kai Li, Can Shen, Yile Liu, Jirui Han, Kelong Zheng, Xuechao Zou, Zhe Wang, Xingjian Du, Shun Zhang, Hanjun Luo, Yingbin Jin, Xinxin Xing, Ziyang Ma, Yue Liu, Xiaojun Jia, Yifan Zhang, Junfeng Fang, Kun Wang, Yibo Yan, Haoyang Li, Yiming Li, Xiaobin Zhuang, Yang Liu, Haibo Hu, Zhuo Chen, Zhizheng Wu, Xiaolin Hu, Eng-Siong Chng, XiaoFeng Wang, Wenyuan Xu, Wei Dong, Xinfeng Li
cs.AI
摘要
音頻大型語言模型(ALLMs)的快速發展和廣泛應用,迫切要求我們對其可信度進行嚴謹的評估。然而,針對這些模型的系統性研究,尤其是涉及音頻模態特有風險的評估,目前仍處於探索階段。現有的評估框架主要集中於文本模態,或僅涵蓋有限的安全維度,未能充分考慮音頻模態的獨特特性和應用場景。我們提出了AudioTrust——首個專為ALLMs設計的多維度可信度評估框架與基準。AudioTrust支持在六個關鍵維度上進行評估:公平性、幻覺、安全性、隱私性、魯棒性和認證性。為全面評估這些維度,AudioTrust圍繞18個不同的實驗設置構建,其核心是一個精心構建的包含超過4,420個音頻/文本樣本的數據集,這些樣本取自現實場景(如日常對話、緊急呼叫、語音助手交互),專門用於探測ALLMs的多維度可信度。為進行評估,該基準精心設計了9個音頻專屬的評估指標,並採用大規模自動化流程對模型輸出進行客觀且可擴展的評分。實驗結果揭示了當前最先進的開源和閉源ALLM在面對各種高風險音頻場景時的可信度邊界與局限,為未來音頻模型的安全可信部署提供了寶貴的洞見。我們的平台與基準可在https://github.com/JusperLee/AudioTrust獲取。
English
The rapid advancement and expanding applications of Audio Large Language
Models (ALLMs) demand a rigorous understanding of their trustworthiness.
However, systematic research on evaluating these models, particularly
concerning risks unique to the audio modality, remains largely unexplored.
Existing evaluation frameworks primarily focus on the text modality or address
only a restricted set of safety dimensions, failing to adequately account for
the unique characteristics and application scenarios inherent to the audio
modality. We introduce AudioTrust-the first multifaceted trustworthiness
evaluation framework and benchmark specifically designed for ALLMs. AudioTrust
facilitates assessments across six key dimensions: fairness, hallucination,
safety, privacy, robustness, and authentication. To comprehensively evaluate
these dimensions, AudioTrust is structured around 18 distinct experimental
setups. Its core is a meticulously constructed dataset of over 4,420 audio/text
samples, drawn from real-world scenarios (e.g., daily conversations, emergency
calls, voice assistant interactions), specifically designed to probe the
multifaceted trustworthiness of ALLMs. For assessment, the benchmark carefully
designs 9 audio-specific evaluation metrics, and we employ a large-scale
automated pipeline for objective and scalable scoring of model outputs.
Experimental results reveal the trustworthiness boundaries and limitations of
current state-of-the-art open-source and closed-source ALLMs when confronted
with various high-risk audio scenarios, offering valuable insights for the
secure and trustworthy deployment of future audio models. Our platform and
benchmark are available at https://github.com/JusperLee/AudioTrust.Summary
AI-Generated Summary