ChatPaper.aiChatPaper

解碼信任:對GPT模型的信譽度全面評估

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

June 20, 2023
作者: Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, Bo Li
cs.AI

摘要

生成式預訓練轉換器(GPT)模型展現了令人振奮的進展,引起了從業者和公眾的興趣。然而,雖然關於GPT模型可信度的文獻仍然有限,從業者已提出將功能強大的GPT模型應用於醫療和金融等敏感應用領域,其中錯誤可能代價高昂。為此,本研究提出了對大型語言模型進行全面可信度評估,重點關注GPT-4和GPT-3.5,考慮多元觀點,包括毒性、刻板印象偏見、對抗強健性、超出分佈強健性、對抗示範的強健性、隱私、機器倫理和公平性。根據我們的評估,我們發現了先前未公開的可信度威脅漏洞。例如,我們發現GPT模型很容易被誤導生成有毒和帶偏見的輸出,並在訓練數據和對話歷史中洩露私人信息。我們還發現,盡管在標準基準測試中GPT-4通常比GPT-3.5更可信,但在越獄系統或用戶提示方面,GPT-4更容易受攻擊,可能是因為GPT-4更精確地遵循(誤導性的)指示。我們的研究展示了對GPT模型的全面可信度評估,並揭示了可信度缺口。我們的基準測試公開在https://decodingtrust.github.io/。
English
Generative Pre-trained Transformer (GPT) models have exhibited exciting progress in capabilities, capturing the interest of practitioners and the public alike. Yet, while the literature on the trustworthiness of GPT models remains limited, practitioners have proposed employing capable GPT models for sensitive applications to healthcare and finance - where mistakes can be costly. To this end, this work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5, considering diverse perspectives - including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. Based on our evaluations, we discover previously unpublished vulnerabilities to trustworthiness threats. For instance, we find that GPT models can be easily misled to generate toxic and biased outputs and leak private information in both training data and conversation history. We also find that although GPT-4 is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more vulnerable given jailbreaking system or user prompts, potentially due to the reason that GPT-4 follows the (misleading) instructions more precisely. Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps. Our benchmark is publicly available at https://decodingtrust.github.io/.
PDF120December 15, 2024