単純なNグラムカバレッジを用いたメンバーシップ推論の驚くべき有効性

要旨

メンバーシップ推論攻撃は、潜在的な著作権侵害の検出やデータ漏洩の監査など、言語モデルの公正な使用を促進する有用なツールとして機能する。しかし、現在の多くの最先端の攻撃手法は、モデルの隠れ状態や確率分布へのアクセスを必要とするため、GPT-4のようなAPIアクセスのみが可能な広く利用されているモデルに対する調査が制限されている。本研究では、N-gramカバレッジ攻撃を提案する。これは、ターゲットモデルからのテキスト出力のみに依存するメンバーシップ推論攻撃であり、完全なブラックボックスモデルに対する攻撃を可能にする。我々は、モデルがその訓練データで頻繁に観察されたテキストパターンを記憶し、その後生成する可能性が高いという観察を活用する。具体的には、候補メンバーに対する予測を行うために、N-gramカバレッジ攻撃はまず、候補のプレフィックスを条件として複数のモデル生成を取得する。次に、これらの出力と真のサフィックスとの類似性をn-gram重複メトリクスを用いて計算し、集約する。高い類似性は、メンバーシップの可能性を示唆する。我々はまず、多様な既存のベンチマークにおいて、N-gramカバレッジ攻撃が他のブラックボックス手法を上回り、テキスト出力のみにアクセスしているにもかかわらず、最先端のホワイトボックス攻撃と同等またはそれ以上の性能を達成することを実証する。興味深いことに、我々の手法の成功率は攻撃計算予算に比例してスケールすることがわかった。プレフィックスを条件としてターゲットモデルから生成されるシーケンスの数を増やすと、攻撃性能が向上する傾向がある。我々の手法の精度を検証した後、複数のドメインにおいて、これまで未調査であったOpenAIのクローズドモデルを調査するためにこれを利用する。我々は、GPT-4oのようなより最近のモデルが、メンバーシップ推論に対する堅牢性を増していることを発見し、プライバシー保護の改善に向けた進化の傾向を示唆している。

English

Membership inference attacks serves as useful tool for fair use of language models, such as detecting potential copyright infringement and auditing data leakage. However, many current state-of-the-art attacks require access to models' hidden states or probability distribution, which prevents investigation into more widely-used, API-access only models like GPT-4. In this work, we introduce N-Gram Coverage Attack, a membership inference attack that relies solely on text outputs from the target model, enabling attacks on completely black-box models. We leverage the observation that models are more likely to memorize and subsequently generate text patterns that were commonly observed in their training data. Specifically, to make a prediction on a candidate member, N-Gram Coverage Attack first obtains multiple model generations conditioned on a prefix of the candidate. It then uses n-gram overlap metrics to compute and aggregate the similarities of these outputs with the ground truth suffix; high similarities indicate likely membership. We first demonstrate on a diverse set of existing benchmarks that N-Gram Coverage Attack outperforms other black-box methods while also impressively achieving comparable or even better performance to state-of-the-art white-box attacks - despite having access to only text outputs. Interestingly, we find that the success rate of our method scales with the attack compute budget - as we increase the number of sequences generated from the target model conditioned on the prefix, attack performance tends to improve. Having verified the accuracy of our method, we use it to investigate previously unstudied closed OpenAI models on multiple domains. We find that more recent models, such as GPT-4o, exhibit increased robustness to membership inference, suggesting an evolving trend toward improved privacy protections.

単純なNグラムカバレッジを用いたメンバーシップ推論の驚くべき有効性

The Surprising Effectiveness of Membership Inference with Simple N-Gram Coverage

要旨

Support