DecodingTrust: GPTモデルの信頼性に関する包括的評価

要旨

Generative Pre-trained Transformer（GPT）モデルは、その能力において目覚ましい進歩を示し、実務家や一般の人々の関心を集めています。しかし、GPTモデルの信頼性に関する文献はまだ限られており、実務家たちは、医療や金融といったミスが高くつく可能性のあるセンシティブなアプリケーションにGPTモデルを活用することを提案しています。この目的のために、本研究では、GPT-4とGPT-3.5に焦点を当てた大規模言語モデルの包括的な信頼性評価を提案します。この評価では、毒性、ステレオタイプバイアス、敵対的ロバスト性、分布外ロバスト性、敵対的デモンストレーションに対するロバスト性、プライバシー、機械倫理、公平性といった多様な視点を考慮しています。我々の評価に基づき、これまで未発表であった信頼性に対する脆弱性を発見しました。例えば、GPTモデルは、容易に誤導されて有毒で偏った出力を生成したり、トレーニングデータや会話履歴からプライベートな情報を漏洩したりすることがわかりました。また、標準的なベンチマークではGPT-4の方が通常GPT-3.5よりも信頼性が高いものの、GPT-4はジャイルブレイキングシステムやユーザープロンプトに対してより脆弱であり、これはGPT-4が（誤った）指示により忠実に従うためである可能性があります。本研究は、GPTモデルの包括的な信頼性評価を示し、信頼性のギャップに光を当てています。我々のベンチマークはhttps://decodingtrust.github.io/で公開されています。

English

Generative Pre-trained Transformer (GPT) models have exhibited exciting progress in capabilities, capturing the interest of practitioners and the public alike. Yet, while the literature on the trustworthiness of GPT models remains limited, practitioners have proposed employing capable GPT models for sensitive applications to healthcare and finance - where mistakes can be costly. To this end, this work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5, considering diverse perspectives - including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. Based on our evaluations, we discover previously unpublished vulnerabilities to trustworthiness threats. For instance, we find that GPT models can be easily misled to generate toxic and biased outputs and leak private information in both training data and conversation history. We also find that although GPT-4 is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more vulnerable given jailbreaking system or user prompts, potentially due to the reason that GPT-4 follows the (misleading) instructions more precisely. Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps. Our benchmark is publicly available at https://decodingtrust.github.io/.

DecodingTrust: GPTモデルの信頼性に関する包括的評価

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

要旨

Support