大規模言語モデルからの記憶データ抽出の制御 - プロンプトチューニングによるアプローチ -

要旨

大規模言語モデル（LLM）は、その訓練データの大部分を記憶していることが知られています。この記憶された内容の一部は、単にモデルに問い合わせることで抽出可能であることが示されており、これはプライバシーリスクをもたらします。本研究では、プロンプトチューニングを用いてLLMにおける記憶内容の抽出率を制御する新たなアプローチを提案します。抽出率を増加させる攻撃と減少させる防御に対応する、2つのプロンプト訓練戦略を提示します。GPT-Neoファミリのモデルを用いて公開ベンチマークで実験を行い、本手法の有効性を実証します。1.3BパラメータのGPT-Neoモデルにおいて、我々の攻撃手法はベースラインと比較して抽出率を9.3パーセンテージポイント向上させました。防御手法は、ユーザー指定のハイパーパラメータによって異なるプライバシーと有用性のトレードオフを実現可能です。ベースラインと比較して最大97.7%の抽出率低減を達成し、その際のパープレキシティの増加は16.9%でした。

English

Large Language Models (LLMs) are known to memorize significant portions of their training data. Parts of this memorized content have been shown to be extractable by simply querying the model, which poses a privacy risk. We present a novel approach which uses prompt-tuning to control the extraction rates of memorized content in LLMs. We present two prompt training strategies to increase and decrease extraction rates, which correspond to an attack and a defense, respectively. We demonstrate the effectiveness of our techniques by using models from the GPT-Neo family on a public benchmark. For the 1.3B parameter GPT-Neo model, our attack yields a 9.3 percentage point increase in extraction rate compared to our baseline. Our defense can be tuned to achieve different privacy-utility trade-offs by a user-specified hyperparameter. We achieve an extraction rate reduction of up to 97.7% relative to our baseline, with a perplexity increase of 16.9%.

大規模言語モデルからの記憶データ抽出の制御 - プロンプトチューニングによるアプローチ -

Controlling the Extraction of Memorized Data from Large Language Models via Prompt-Tuning

要旨

Support