通过紧凑表示下一词元分布实现更优的语言模型反演

摘要

语言模型逆向工程旨在仅通过语言模型的输出来恢复隐藏的提示。这一能力对语言模型部署的安全性和问责制具有重要影响，例如从受API保护的语言模型的系统消息中泄露私人信息。我们提出了一种新方法——基于对数概率序列的提示逆向工程（PILS），该方法通过从模型在多个生成步骤中的下一个词概率中提取线索来恢复隐藏提示。我们的方法基于一个关键洞察：语言模型的向量输出占据了一个低维子空间。这使得我们能够使用线性映射无损地压缩多个生成步骤中的完整下一个词概率分布，从而利用更多输出信息进行逆向工程。与之前最先进的方法相比，我们的方法在恢复隐藏提示方面取得了巨大提升，在测试集上实现了2至3.5倍的精确恢复率提升，其中一个案例的恢复率从17%提高到了60%。我们的方法还表现出令人惊讶的良好泛化行为；例如，在16个生成步骤上训练的逆向器，当我们在测试时将步骤数增加到32时，提示恢复率提高了5到27个百分点。此外，我们展示了我们的方法在更具挑战性的恢复隐藏系统消息任务上的强大性能。我们还分析了逐字重复在提示恢复中的作用，并提出了一种新的基于logit的逆向器跨家族模型迁移方法。我们的研究结果表明，下一个词概率是比之前已知的更为脆弱的逆向攻击面。

English

Language model inversion seeks to recover hidden prompts using only language model outputs. This capability has implications for security and accountability in language model deployments, such as leaking private information from an API-protected language model's system message. We propose a new method -- prompt inversion from logprob sequences (PILS) -- that recovers hidden prompts by gleaning clues from the model's next-token probabilities over the course of multiple generation steps. Our method is enabled by a key insight: The vector-valued outputs of a language model occupy a low-dimensional subspace. This enables us to losslessly compress the full next-token probability distribution over multiple generation steps using a linear map, allowing more output information to be used for inversion. Our approach yields massive gains over previous state-of-the-art methods for recovering hidden prompts, achieving 2--3.5 times higher exact recovery rates across test sets, in one case increasing the recovery rate from 17% to 60%. Our method also exhibits surprisingly good generalization behavior; for instance, an inverter trained on 16 generations steps gets 5--27 points higher prompt recovery when we increase the number of steps to 32 at test time. Furthermore, we demonstrate strong performance of our method on the more challenging task of recovering hidden system messages. We also analyze the role of verbatim repetition in prompt recovery and propose a new method for cross-family model transfer for logit-based inverters. Our findings show that next-token probabilities are a considerably more vulnerable attack surface for inversion attacks than previously known.

通过紧凑表示下一词元分布实现更优的语言模型反演

Better Language Model Inversion by Compactly Representing Next-Token Distributions

摘要

Support