簡單N元覆蓋率在成員推斷中的驚人效果
The Surprising Effectiveness of Membership Inference with Simple N-Gram Coverage
August 13, 2025
作者: Skyler Hallinan, Jaehun Jung, Melanie Sclar, Ximing Lu, Abhilasha Ravichander, Sahana Ramnath, Yejin Choi, Sai Praneeth Karimireddy, Niloofar Mireshghallah, Xiang Ren
cs.AI
摘要
成員推斷攻擊作為公平使用語言模型的有用工具,例如檢測潛在的版權侵權和審計數據洩露。然而,許多當前最先進的攻擊需要訪問模型的隱藏狀態或概率分佈,這阻礙了對更廣泛使用的、僅通過API訪問的模型(如GPT-4)的調查。在本研究中,我們引入了N-Gram覆蓋攻擊,這是一種僅依賴於目標模型文本輸出的成員推斷攻擊,使得對完全黑箱模型的攻擊成為可能。我們利用了一個觀察結果,即模型更有可能記住並隨後生成在其訓練數據中常見的文本模式。具體來說,為了對候選成員進行預測,N-Gram覆蓋攻擊首先獲取基於候選前綴的多個模型生成文本。然後,它使用n-gram重疊度量來計算並聚合這些輸出與真實後綴的相似性;高相似性表明可能的成員身份。我們首先在現有的多樣化基準上展示了N-Gram覆蓋攻擊優於其他黑箱方法,同時令人印象深刻地達到了與最先進的白箱攻擊相當甚至更好的性能——儘管僅能訪問文本輸出。有趣的是,我們發現我們方法的成功率隨著攻擊計算預算的增加而提高——隨著我們增加基於前綴從目標模型生成的序列數量,攻擊性能往往會提升。在驗證了我們方法的準確性後,我們使用它來調查多個領域中先前未研究的封閉OpenAI模型。我們發現,較新的模型(如GPT-4o)對成員推斷表現出更高的魯棒性,這表明隱私保護的改進趨勢正在演進。
English
Membership inference attacks serves as useful tool for fair use of language
models, such as detecting potential copyright infringement and auditing data
leakage. However, many current state-of-the-art attacks require access to
models' hidden states or probability distribution, which prevents investigation
into more widely-used, API-access only models like GPT-4. In this work, we
introduce N-Gram Coverage Attack, a membership inference attack that relies
solely on text outputs from the target model, enabling attacks on completely
black-box models. We leverage the observation that models are more likely to
memorize and subsequently generate text patterns that were commonly observed in
their training data. Specifically, to make a prediction on a candidate member,
N-Gram Coverage Attack first obtains multiple model generations conditioned on
a prefix of the candidate. It then uses n-gram overlap metrics to compute and
aggregate the similarities of these outputs with the ground truth suffix; high
similarities indicate likely membership. We first demonstrate on a diverse set
of existing benchmarks that N-Gram Coverage Attack outperforms other black-box
methods while also impressively achieving comparable or even better performance
to state-of-the-art white-box attacks - despite having access to only text
outputs. Interestingly, we find that the success rate of our method scales with
the attack compute budget - as we increase the number of sequences generated
from the target model conditioned on the prefix, attack performance tends to
improve. Having verified the accuracy of our method, we use it to investigate
previously unstudied closed OpenAI models on multiple domains. We find that
more recent models, such as GPT-4o, exhibit increased robustness to membership
inference, suggesting an evolving trend toward improved privacy protections.