通過緊湊表示下一詞分佈實現更優的語言模型反演
Better Language Model Inversion by Compactly Representing Next-Token Distributions
June 20, 2025
作者: Murtaza Nazir, Matthew Finlayson, John X. Morris, Xiang Ren, Swabha Swayamdipta
cs.AI
摘要
語言模型反演旨在僅依賴語言模型輸出來恢復隱藏的提示。此能力對語言模型部署中的安全性和問責性具有重要意義,例如從受API保護的語言模型的系統消息中洩露私人信息。我們提出了一種新方法——基於對數概率序列的提示反演(PILS),該方法通過在多個生成步驟中從模型的下一個詞彙概率中提取線索來恢復隱藏提示。我們的方法基於一個關鍵洞察:語言模型的向量值輸出佔據了一個低維子空間。這使我們能夠使用線性映射無損壓縮多個生成步驟中的完整下一個詞彙概率分佈,從而利用更多輸出信息進行反演。與之前最先進的隱藏提示恢復方法相比,我們的方法取得了顯著提升,在測試集上實現了2至3.5倍的精確恢復率提升,其中一個案例將恢復率從17%提高到了60%。我們的方法還展現出驚人的泛化能力;例如,一個在16個生成步驟上訓練的反演器,在測試時將步驟數增加到32時,提示恢復率提高了5至27個百分點。此外,我們展示了我們的方法在更具挑戰性的隱藏系統消息恢復任務上的強大性能。我們還分析了逐字重複在提示恢復中的作用,並提出了一種新的基於對數概率的反演器跨家族模型遷移方法。我們的研究結果表明,下一個詞彙概率作為反演攻擊的攻擊面,其脆弱性遠超以往認知。
English
Language model inversion seeks to recover hidden prompts using only language
model outputs. This capability has implications for security and accountability
in language model deployments, such as leaking private information from an
API-protected language model's system message. We propose a new method --
prompt inversion from logprob sequences (PILS) -- that recovers hidden prompts
by gleaning clues from the model's next-token probabilities over the course of
multiple generation steps. Our method is enabled by a key insight: The
vector-valued outputs of a language model occupy a low-dimensional subspace.
This enables us to losslessly compress the full next-token probability
distribution over multiple generation steps using a linear map, allowing more
output information to be used for inversion. Our approach yields massive gains
over previous state-of-the-art methods for recovering hidden prompts, achieving
2--3.5 times higher exact recovery rates across test sets, in one case
increasing the recovery rate from 17% to 60%. Our method also exhibits
surprisingly good generalization behavior; for instance, an inverter trained on
16 generations steps gets 5--27 points higher prompt recovery when we increase
the number of steps to 32 at test time. Furthermore, we demonstrate strong
performance of our method on the more challenging task of recovering hidden
system messages. We also analyze the role of verbatim repetition in prompt
recovery and propose a new method for cross-family model transfer for
logit-based inverters. Our findings show that next-token probabilities are a
considerably more vulnerable attack surface for inversion attacks than
previously known.