通過緊湊表示下一詞分佈實現更優的語言模型反演

摘要

語言模型反演旨在僅依賴語言模型輸出來恢復隱藏的提示。此能力對語言模型部署中的安全性和問責性具有重要意義，例如從受API保護的語言模型的系統消息中洩露私人信息。我們提出了一種新方法——基於對數概率序列的提示反演（PILS），該方法通過在多個生成步驟中從模型的下一個詞彙概率中提取線索來恢復隱藏提示。我們的方法基於一個關鍵洞察：語言模型的向量值輸出佔據了一個低維子空間。這使我們能夠使用線性映射無損壓縮多個生成步驟中的完整下一個詞彙概率分佈，從而利用更多輸出信息進行反演。與之前最先進的隱藏提示恢復方法相比，我們的方法取得了顯著提升，在測試集上實現了2至3.5倍的精確恢復率提升，其中一個案例將恢復率從17%提高到了60%。我們的方法還展現出驚人的泛化能力；例如，一個在16個生成步驟上訓練的反演器，在測試時將步驟數增加到32時，提示恢復率提高了5至27個百分點。此外，我們展示了我們的方法在更具挑戰性的隱藏系統消息恢復任務上的強大性能。我們還分析了逐字重複在提示恢復中的作用，並提出了一種新的基於對數概率的反演器跨家族模型遷移方法。我們的研究結果表明，下一個詞彙概率作為反演攻擊的攻擊面，其脆弱性遠超以往認知。

English

Language model inversion seeks to recover hidden prompts using only language model outputs. This capability has implications for security and accountability in language model deployments, such as leaking private information from an API-protected language model's system message. We propose a new method -- prompt inversion from logprob sequences (PILS) -- that recovers hidden prompts by gleaning clues from the model's next-token probabilities over the course of multiple generation steps. Our method is enabled by a key insight: The vector-valued outputs of a language model occupy a low-dimensional subspace. This enables us to losslessly compress the full next-token probability distribution over multiple generation steps using a linear map, allowing more output information to be used for inversion. Our approach yields massive gains over previous state-of-the-art methods for recovering hidden prompts, achieving 2--3.5 times higher exact recovery rates across test sets, in one case increasing the recovery rate from 17% to 60%. Our method also exhibits surprisingly good generalization behavior; for instance, an inverter trained on 16 generations steps gets 5--27 points higher prompt recovery when we increase the number of steps to 32 at test time. Furthermore, we demonstrate strong performance of our method on the more challenging task of recovering hidden system messages. We also analyze the role of verbatim repetition in prompt recovery and propose a new method for cross-family model transfer for logit-based inverters. Our findings show that next-token probabilities are a considerably more vulnerable attack surface for inversion attacks than previously known.

通過緊湊表示下一詞分佈實現更優的語言模型反演

Better Language Model Inversion by Compactly Representing Next-Token Distributions

摘要

Support