竊取部分生產語言模型

摘要

我們提出了第一個模型竊取攻擊，可以從黑盒製作的語言模型（如OpenAI的ChatGPT或Google的PaLM-2）中提取精確且非平凡的信息。具體來說，我們的攻擊可以在典型的API訪問情況下，恢復一個Transformer模型的嵌入投影層（在對稱性上）。以不到20美元的成本，我們的攻擊可以提取OpenAI的Ada和Babbage語言模型的整個投影矩陣。我們首次確認這些黑盒模型分別具有1024和2048的隱藏維度。我們還恢復了gpt-3.5-turbo模型的確切隱藏維度大小，並估計成本不到2,000個查詢即可恢復整個投影矩陣。最後，我們提出潛在的防禦和緩解方法，並討論可能擴展我們攻擊的未來工作的影響。

English

We introduce the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like OpenAI's ChatGPT or Google's PaLM-2. Specifically, our attack recovers the embedding projection layer (up to symmetries) of a transformer model, given typical API access. For under \20 USD, our attack extracts the entire projection matrix of OpenAI's Ada and Babbage language models. We thereby confirm, for the first time, that these black-box models have a hidden dimension of 1024 and 2048, respectively. We also recover the exact hidden dimension size of the gpt-3.5-turbo model, and estimate it would cost under 2,000 in queries to recover the entire projection matrix. We conclude with potential defenses and mitigations, and discuss the implications of possible future work that could extend our attack.

竊取部分生產語言模型

Stealing Part of a Production Language Model

摘要

Support