次のトークン分布をコンパクトに表現することによる優れた言語モデルの逆変換

要旨

言語モデルの逆転は、言語モデルの出力のみを用いて隠されたプロンプトを復元することを目指す。この能力は、言語モデルの展開におけるセキュリティと説明責任に影響を及ぼす可能性があり、例えばAPIで保護された言語モデルのシステムメッセージからプライベート情報が漏洩するリスクがある。本論文では、新しい手法——ログ確率系列からのプロンプト逆転（PILS）——を提案する。この手法は、複数の生成ステップにわたるモデルの次トークン確率から手がかりを得て、隠されたプロンプトを復元する。我々の手法は、言語モデルのベクトル値出力が低次元部分空間に存在するという重要な洞察に基づいている。これにより、線形写像を用いて複数の生成ステップにわたる完全な次トークン確率分布をロスレスで圧縮することが可能となり、逆転のためにより多くの出力情報を利用できる。我々のアプローチは、隠されたプロンプトを復元するための従来の最先端手法を大幅に上回り、テストセット全体で2～3.5倍高い正確な復元率を達成し、あるケースでは復元率を17%から60%に向上させた。また、我々の手法は驚くほど良好な一般化特性を示す。例えば、16生成ステップで訓練された逆転器は、テスト時にステップ数を32に増やすと、プロンプト復元率が5～27ポイント向上する。さらに、我々の手法は、より困難なタスクである隠されたシステムメッセージの復元においても強力な性能を発揮することを実証した。また、プロンプト復元における逐語的な繰り返しの役割を分析し、ロジットベースの逆転器のためのクロスファミリーモデル転送の新しい手法を提案する。我々の研究結果は、次トークン確率が、これまで知られていたよりもはるかに脆弱な逆転攻撃の攻撃対象であることを示している。

English

Language model inversion seeks to recover hidden prompts using only language model outputs. This capability has implications for security and accountability in language model deployments, such as leaking private information from an API-protected language model's system message. We propose a new method -- prompt inversion from logprob sequences (PILS) -- that recovers hidden prompts by gleaning clues from the model's next-token probabilities over the course of multiple generation steps. Our method is enabled by a key insight: The vector-valued outputs of a language model occupy a low-dimensional subspace. This enables us to losslessly compress the full next-token probability distribution over multiple generation steps using a linear map, allowing more output information to be used for inversion. Our approach yields massive gains over previous state-of-the-art methods for recovering hidden prompts, achieving 2--3.5 times higher exact recovery rates across test sets, in one case increasing the recovery rate from 17% to 60%. Our method also exhibits surprisingly good generalization behavior; for instance, an inverter trained on 16 generations steps gets 5--27 points higher prompt recovery when we increase the number of steps to 32 at test time. Furthermore, we demonstrate strong performance of our method on the more challenging task of recovering hidden system messages. We also analyze the role of verbatim repetition in prompt recovery and propose a new method for cross-family model transfer for logit-based inverters. Our findings show that next-token probabilities are a considerably more vulnerable attack surface for inversion attacks than previously known.

次のトークン分布をコンパクトに表現することによる優れた言語モデルの逆変換

Better Language Model Inversion by Compactly Representing Next-Token Distributions

要旨

Support