利用新颖的GPT-4 API

摘要

语言模型攻击通常假定两种极端的威胁模型之一：完全白盒访问模型权重，或者仅限于文本生成 API 的黑盒访问。然而，现实世界中的 API 往往比仅限于文本生成更加灵活：这些 API 提供“灰盒”访问，导致新的威胁向量。为了探索这一点，我们对 GPT-4 API 中暴露的三个新功能进行了红队测试：微调、函数调用和知识检索。我们发现，对模型进行微调，即使是在 15 个有害示例或 100 个良性示例的情况下，也可以从 GPT-4 中删除核心保障，从而实现一系列有害输出。此外，我们发现 GPT-4 助手很容易泄露函数调用模式，并且可以执行任意函数调用。最后，我们发现知识检索可以被劫持，通过向检索文档中注入指令。这些漏洞突显了 API 暴露的任何功能增加都可能带来新的漏洞。

English

Language model attacks typically assume one of two extreme threat models: full white-box access to model weights, or black-box access limited to a text generation API. However, real-world APIs are often more flexible than just text generation: these APIs expose ``gray-box'' access leading to new threat vectors. To explore this, we red-team three new functionalities exposed in the GPT-4 APIs: fine-tuning, function calling and knowledge retrieval. We find that fine-tuning a model on as few as 15 harmful examples or 100 benign examples can remove core safeguards from GPT-4, enabling a range of harmful outputs. Furthermore, we find that GPT-4 Assistants readily divulge the function call schema and can be made to execute arbitrary function calls. Finally, we find that knowledge retrieval can be hijacked by injecting instructions into retrieval documents. These vulnerabilities highlight that any additions to the functionality exposed by an API can create new vulnerabilities.

利用新颖的GPT-4 API

Exploiting Novel GPT-4 APIs

摘要

Support