DecodingTrust:对GPT模型信任度的全面评估
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
June 20, 2023
作者: Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, Bo Li
cs.AI
摘要
生成式预训练变换器(GPT)模型在能力方面取得了令人振奋的进展,引起了从业者和公众的兴趣。然而,尽管有关GPT模型可信度的文献仍然有限,从业者已经提出将功能强大的GPT模型应用于医疗保健和金融等敏感应用领域 - 在这些领域错误可能代价高昂。为此,本文提出了对大型语言模型进行全面可信度评估的方法,重点关注GPT-4和GPT-3.5,考虑了多种视角 - 包括毒性、刻板印象偏见、对抗鲁棒性、分布外鲁棒性、对抗示范的鲁棒性、隐私、机器伦理和公平性。根据我们的评估,我们发现了以前未公开的可信度威胁漏洞。例如,我们发现GPT模型很容易被误导生成有毒和带偏见的输出,并泄露训练数据和对话历史中的私人信息。我们还发现,尽管GPT-4在标准基准测试中通常比GPT-3.5更可信,但在越狱系统或用户提示方面更容易受到攻击,可能是因为GPT-4更精确地遵循(误导性的)指令。我们的工作展示了对GPT模型进行全面可信度评估,并揭示了可信度方面的差距。我们的基准测试公开可在https://decodingtrust.github.io/获取。
English
Generative Pre-trained Transformer (GPT) models have exhibited exciting
progress in capabilities, capturing the interest of practitioners and the
public alike. Yet, while the literature on the trustworthiness of GPT models
remains limited, practitioners have proposed employing capable GPT models for
sensitive applications to healthcare and finance - where mistakes can be
costly. To this end, this work proposes a comprehensive trustworthiness
evaluation for large language models with a focus on GPT-4 and GPT-3.5,
considering diverse perspectives - including toxicity, stereotype bias,
adversarial robustness, out-of-distribution robustness, robustness on
adversarial demonstrations, privacy, machine ethics, and fairness. Based on our
evaluations, we discover previously unpublished vulnerabilities to
trustworthiness threats. For instance, we find that GPT models can be easily
misled to generate toxic and biased outputs and leak private information in
both training data and conversation history. We also find that although GPT-4
is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more
vulnerable given jailbreaking system or user prompts, potentially due to the
reason that GPT-4 follows the (misleading) instructions more precisely. Our
work illustrates a comprehensive trustworthiness evaluation of GPT models and
sheds light on the trustworthiness gaps. Our benchmark is publicly available at
https://decodingtrust.github.io/.