新規GPT-4 APIの活用

要旨

言語モデル攻撃は通常、2つの極端な脅威モデルのいずれかを想定しています：モデルの重みへの完全なホワイトボックスアクセス、またはテキスト生成APIに限定されたブラックボックスアクセスです。しかし、現実世界のAPIは、単なるテキスト生成よりも柔軟であることが多く、これらのAPIは「グレーボックス」アクセスを提供し、新たな脅威ベクトルを生み出します。これを探るため、私たちはGPT-4 APIで公開された3つの新機能（ファインチューニング、関数呼び出し、知識検索）をレッドチームしました。その結果、わずか15の有害な例または100の無害な例でモデルをファインチューニングすることで、GPT-4のコアセーフガードを解除し、さまざまな有害な出力を可能にすることがわかりました。さらに、GPT-4アシスタントは関数呼び出しスキーマを容易に漏洩させ、任意の関数呼び出しを実行させることができることがわかりました。最後に、知識検索は、検索ドキュメントに命令を注入することでハイジャックできることがわかりました。これらの脆弱性は、APIによって公開される機能の追加が新たな脆弱性を生み出す可能性があることを強調しています。

English

Language model attacks typically assume one of two extreme threat models: full white-box access to model weights, or black-box access limited to a text generation API. However, real-world APIs are often more flexible than just text generation: these APIs expose ``gray-box'' access leading to new threat vectors. To explore this, we red-team three new functionalities exposed in the GPT-4 APIs: fine-tuning, function calling and knowledge retrieval. We find that fine-tuning a model on as few as 15 harmful examples or 100 benign examples can remove core safeguards from GPT-4, enabling a range of harmful outputs. Furthermore, we find that GPT-4 Assistants readily divulge the function call schema and can be made to execute arbitrary function calls. Finally, we find that knowledge retrieval can be hijacked by injecting instructions into retrieval documents. These vulnerabilities highlight that any additions to the functionality exposed by an API can create new vulnerabilities.

新規GPT-4 APIの活用

Exploiting Novel GPT-4 APIs

要旨

Support