다음 토큰 분포를 간결하게 표현하여 향상된 언어 모델 역전파

초록

언어 모델 역추적은 언어 모델 출력만을 사용하여 숨겨진 프롬프트를 복구하는 것을 목표로 한다. 이 능력은 API로 보호된 언어 모델의 시스템 메시지에서 개인 정보가 유출되는 등, 언어 모델 배포에서 보안과 책임성에 중요한 함의를 가진다. 우리는 새로운 방법인 로그 확률 시퀀스로부터의 프롬프트 역추적(PILS)을 제안한다. 이 방법은 여러 생성 단계에 걸쳐 모델의 다음 토큰 확률로부터 단서를 얻어 숨겨진 프롬프트를 복구한다. 우리의 방법은 언어 모델의 벡터 값 출력이 저차원 부분공간을 차지한다는 핵심 통찰에 의해 가능해진다. 이를 통해 여러 생성 단계에 걸친 전체 다음 토큰 확률 분포를 선형 맵을 사용하여 무손실로 압축할 수 있으며, 역추적을 위해 더 많은 출력 정보를 활용할 수 있다. 우리의 접근 방식은 숨겨진 프롬프트를 복구하는 기존 최신 방법에 비해 큰 성능 향상을 보여주며, 테스트 세트에서 정확한 복구율이 2~3.5배 높아졌고, 한 경우에는 복구율이 17%에서 60%로 증가했다. 또한 우리의 방법은 놀라울 정도로 좋은 일반화 성능을 보인다. 예를 들어, 16단계 생성으로 훈련된 역추적기는 테스트 시 단계 수를 32로 늘렸을 때 프롬프트 복구율이 5~27점 더 높아졌다. 더 나아가, 우리의 방법은 숨겨진 시스템 메시지를 복구하는 더 어려운 작업에서도 강력한 성능을 보였다. 또한 우리는 프롬프트 복구에서 직반복의 역할을 분석하고, 로짓 기반 역추적기를 위한 새로운 크로스-패밀리 모델 전이 방법을 제안한다. 우리의 연구 결과는 다음 토큰 확률이 이전에 알려진 것보다 역추적 공격에 훨씬 더 취약한 공격 표면임을 보여준다.

English

Language model inversion seeks to recover hidden prompts using only language model outputs. This capability has implications for security and accountability in language model deployments, such as leaking private information from an API-protected language model's system message. We propose a new method -- prompt inversion from logprob sequences (PILS) -- that recovers hidden prompts by gleaning clues from the model's next-token probabilities over the course of multiple generation steps. Our method is enabled by a key insight: The vector-valued outputs of a language model occupy a low-dimensional subspace. This enables us to losslessly compress the full next-token probability distribution over multiple generation steps using a linear map, allowing more output information to be used for inversion. Our approach yields massive gains over previous state-of-the-art methods for recovering hidden prompts, achieving 2--3.5 times higher exact recovery rates across test sets, in one case increasing the recovery rate from 17% to 60%. Our method also exhibits surprisingly good generalization behavior; for instance, an inverter trained on 16 generations steps gets 5--27 points higher prompt recovery when we increase the number of steps to 32 at test time. Furthermore, we demonstrate strong performance of our method on the more challenging task of recovering hidden system messages. We also analyze the role of verbatim repetition in prompt recovery and propose a new method for cross-family model transfer for logit-based inverters. Our findings show that next-token probabilities are a considerably more vulnerable attack surface for inversion attacks than previously known.

다음 토큰 분포를 간결하게 표현하여 향상된 언어 모델 역전파

Better Language Model Inversion by Compactly Representing Next-Token Distributions

초록

Support