TrustLLM: 大型语言模型的可信度
TrustLLM: Trustworthiness in Large Language Models
January 10, 2024
作者: Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, Zhengliang Liu, Yixin Liu, Yijue Wang, Zhikun Zhang, Bhavya Kailkhura, Caiming Xiong, Chao Zhang, Chaowei Xiao, Chunyuan Li, Eric Xing, Furong Huang, Hao Liu, Heng Ji, Hongyi Wang, Huan Zhang, Huaxiu Yao, Manolis Kellis, Marinka Zitnik, Meng Jiang, Mohit Bansal, James Zou, Jian Pei, Jian Liu, Jianfeng Gao, Jiawei Han, Jieyu Zhao, Jiliang Tang, Jindong Wang, John Mitchell, Kai Shu, Kaidi Xu, Kai-Wei Chang, Lifang He, Lifu Huang, Michael Backes, Neil Zhenqiang Gong, Philip S. Yu, Pin-Yu Chen, Quanquan Gu, Ran Xu, Rex Ying, Shuiwang Ji, Suman Jana, Tianlong Chen, Tianming Liu, Tianyi Zhou, Willian Wang, Xiang Li, Xiangliang Zhang, Xiao Wang, Xing Xie, Xun Chen, Xuyu Wang, Yan Liu, Yanfang Ye, Yinzhi Cao, Yue Zhao
cs.AI
摘要
大型语言模型(LLMs),如ChatGPT所示,因其出色的自然语言处理能力而备受关注。然而,这些LLMs在可信度领域面临许多挑战。因此,确保LLMs的可信度成为一个重要课题。本文介绍了TrustLLM,这是对LLMs可信度的全面研究,包括不同可信度维度的原则、建立基准、评估和分析主流LLMs的可信度,以及讨论开放挑战和未来方向。具体而言,我们首先提出了涵盖八个不同维度的可信LLMs原则。基于这些原则,我们进一步跨越六个维度建立了一个基准,包括真实性、安全性、公平性、鲁棒性、隐私性和机器伦理。然后,我们在TrustLLM中对16个主流LLMs进行了研究,涵盖了30多个数据集。我们的研究结果首先显示,总体上可信度和效用(即功能有效性)呈正相关关系。其次,我们的观察揭示,专有LLMs通常在可信度方面优于大多数开源对应物,引发对广泛可访问的开源LLMs潜在风险的担忧。然而,一些开源LLMs与专有LLMs非常接近。第三,重要的是要注意,一些LLMs可能过度校准以展示可信度,导致误将良性提示视为有害并因此不予回应,从而损害其效用。最后,我们强调确保透明度的重要性,不仅在模型本身,还在支撑可信度的技术中。了解已采用的具体可信技术对分析其有效性至关重要。
English
Large language models (LLMs), exemplified by ChatGPT, have gained
considerable attention for their excellent natural language processing
capabilities. Nonetheless, these LLMs present many challenges, particularly in
the realm of trustworthiness. Therefore, ensuring the trustworthiness of LLMs
emerges as an important topic. This paper introduces TrustLLM, a comprehensive
study of trustworthiness in LLMs, including principles for different dimensions
of trustworthiness, established benchmark, evaluation, and analysis of
trustworthiness for mainstream LLMs, and discussion of open challenges and
future directions. Specifically, we first propose a set of principles for
trustworthy LLMs that span eight different dimensions. Based on these
principles, we further establish a benchmark across six dimensions including
truthfulness, safety, fairness, robustness, privacy, and machine ethics. We
then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of
over 30 datasets. Our findings firstly show that in general trustworthiness and
utility (i.e., functional effectiveness) are positively related. Secondly, our
observations reveal that proprietary LLMs generally outperform most open-source
counterparts in terms of trustworthiness, raising concerns about the potential
risks of widely accessible open-source LLMs. However, a few open-source LLMs
come very close to proprietary ones. Thirdly, it is important to note that some
LLMs may be overly calibrated towards exhibiting trustworthiness, to the extent
that they compromise their utility by mistakenly treating benign prompts as
harmful and consequently not responding. Finally, we emphasize the importance
of ensuring transparency not only in the models themselves but also in the
technologies that underpin trustworthiness. Knowing the specific trustworthy
technologies that have been employed is crucial for analyzing their
effectiveness.