TrustLLM：大型語言模型的可信度

TrustLLM: Trustworthiness in Large Language Models

January 10, 2024

作者: Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, Zhengliang Liu, Yixin Liu, Yijue Wang, Zhikun Zhang, Bhavya Kailkhura, Caiming Xiong, Chao Zhang, Chaowei Xiao, Chunyuan Li, Eric Xing, Furong Huang, Hao Liu, Heng Ji, Hongyi Wang, Huan Zhang, Huaxiu Yao, Manolis Kellis, Marinka Zitnik, Meng Jiang, Mohit Bansal, James Zou, Jian Pei, Jian Liu, Jianfeng Gao, Jiawei Han, Jieyu Zhao, Jiliang Tang, Jindong Wang, John Mitchell, Kai Shu, Kaidi Xu, Kai-Wei Chang, Lifang He, Lifu Huang, Michael Backes, Neil Zhenqiang Gong, Philip S. Yu, Pin-Yu Chen, Quanquan Gu, Ran Xu, Rex Ying, Shuiwang Ji, Suman Jana, Tianlong Chen, Tianming Liu, Tianyi Zhou, Willian Wang, Xiang Li, Xiangliang Zhang, Xiao Wang, Xing Xie, Xun Chen, Xuyu Wang, Yan Liu, Yanfang Ye, Yinzhi Cao, Yue Zhao

cs.AI

摘要

大型語言模型（LLMs），如ChatGPT，因其出色的自然語言處理能力而受到廣泛關注。然而，這些LLMs在可信度方面存在許多挑戰。因此，確保LLMs的可信度成為一個重要話題。本文介紹TrustLLM，這是一項關於LLMs可信度的全面研究，包括不同可信度維度的原則、建立基準、評估和分析主流LLMs的可信度，以及討論開放挑戰和未來方向。具體而言，我們首先提出了一套涵蓋八個不同維度的可信度LLMs原則。基於這些原則，我們進一步在六個維度上建立了一個基準，包括真實性、安全性、公平性、穩健性、隱私性和機器倫理。然後，我們在TrustLLM中對16個主流LLMs進行了研究，涉及30多個數據集。我們的研究結果首先顯示，一般而言，可信度和效用（即功能有效性）呈正相關。其次，我們的觀察顯示，專有LLMs通常在可信度方面優於大多數開源對手，這引發了對廣泛可訪問的開源LLMs潛在風險的擔憂。然而，一些開源LLMs與專有LLMs非常接近。第三，重要的是要注意，一些LLMs可能過度校準以展示可信度，以至於通過錯誤地將良性提示視為有害而無法回應，從而犧牲了其效用。最後，我們強調確保透明度的重要性，不僅在模型本身，還在支撐可信度的技術上。了解已應用的具體可信度技術對於分析其有效性至關重要。

English

Large language models (LLMs), exemplified by ChatGPT, have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. Therefore, ensuring the trustworthiness of LLMs emerges as an important topic. This paper introduces TrustLLM, a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. Our findings firstly show that in general trustworthiness and utility (i.e., functional effectiveness) are positively related. Secondly, our observations reveal that proprietary LLMs generally outperform most open-source counterparts in terms of trustworthiness, raising concerns about the potential risks of widely accessible open-source LLMs. However, a few open-source LLMs come very close to proprietary ones. Thirdly, it is important to note that some LLMs may be overly calibrated towards exhibiting trustworthiness, to the extent that they compromise their utility by mistakenly treating benign prompts as harmful and consequently not responding. Finally, we emphasize the importance of ensuring transparency not only in the models themselves but also in the technologies that underpin trustworthiness. Knowing the specific trustworthy technologies that have been employed is crucial for analyzing their effectiveness.