프로덕션 언어 모델의 일부를 훔치기

초록

우리는 OpenAI의 ChatGPT나 Google의 PaLM-2와 같은 블랙박스 생산 언어 모델로부터 정확하고 중요한 정보를 추출하는 최초의 모델 도용 공격을 소개합니다. 구체적으로, 우리의 공격은 일반적인 API 접근을 통해 트랜스포머 모델의 임베딩 투영 레이어(대칭성까지)를 복구합니다. 20달러 미만의 비용으로, 우리는 OpenAI의 Ada와 Babbage 언어 모델의 전체 투영 행렬을 추출합니다. 이를 통해 우리는 처음으로 이러한 블랙박스 모델이 각각 1024와 2048의 은닉 차원을 가지고 있음을 확인했습니다. 또한, 우리는 gpt-3.5-turbo 모델의 정확한 은닉 차원 크기를 복구했으며, 전체 투영 행렬을 복구하는 데 2,000달러 미만의 쿼리 비용이 소요될 것으로 추정했습니다. 마지막으로, 잠재적인 방어 및 완화 방안을 제시하고, 우리의 공격을 확장할 수 있는 미래 연구의 함의에 대해 논의합니다.

English

We introduce the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like OpenAI's ChatGPT or Google's PaLM-2. Specifically, our attack recovers the embedding projection layer (up to symmetries) of a transformer model, given typical API access. For under \20 USD, our attack extracts the entire projection matrix of OpenAI's Ada and Babbage language models. We thereby confirm, for the first time, that these black-box models have a hidden dimension of 1024 and 2048, respectively. We also recover the exact hidden dimension size of the gpt-3.5-turbo model, and estimate it would cost under 2,000 in queries to recover the entire projection matrix. We conclude with potential defenses and mitigations, and discuss the implications of possible future work that could extend our attack.

프로덕션 언어 모델의 일부를 훔치기

Stealing Part of a Production Language Model

초록

Support