简单N元覆盖在成员推断中的惊人效果
The Surprising Effectiveness of Membership Inference with Simple N-Gram Coverage
August 13, 2025
作者: Skyler Hallinan, Jaehun Jung, Melanie Sclar, Ximing Lu, Abhilasha Ravichander, Sahana Ramnath, Yejin Choi, Sai Praneeth Karimireddy, Niloofar Mireshghallah, Xiang Ren
cs.AI
摘要
成员推断攻击作为一种有效工具,在语言模型的合理使用中发挥着重要作用,例如检测潜在的版权侵权和审计数据泄露。然而,当前许多最先进的攻击方法需要访问模型的隐藏状态或概率分布,这阻碍了对仅通过API访问的广泛使用模型(如GPT-4)的研究。在本研究中,我们提出了N-Gram覆盖攻击,这是一种仅依赖于目标模型文本输出的成员推断攻击,使得对完全黑箱模型的攻击成为可能。我们利用了一个观察结果:模型更倾向于记忆并随后生成其训练数据中常见的文本模式。具体而言,为了对候选成员做出预测,N-Gram覆盖攻击首先基于候选文本的前缀获取多个模型生成结果,然后使用n-gram重叠度量来计算并汇总这些输出与真实后缀的相似度;高相似度表明可能的成员身份。我们首先在多样化的现有基准测试中展示了N-Gram覆盖攻击优于其他黑箱方法,同时令人印象深刻地达到了与最先进的白箱攻击相当甚至更好的性能——尽管仅能访问文本输出。有趣的是,我们发现该方法的成功率随着攻击计算预算的增加而提升——随着基于前缀从目标模型生成的序列数量增加,攻击性能往往有所改善。在验证了方法的准确性后,我们将其应用于多个领域,对之前未研究的封闭式OpenAI模型进行了调查。我们发现,较新的模型(如GPT-4o)对成员推断表现出更高的鲁棒性,暗示着隐私保护正朝着改进的方向发展。
English
Membership inference attacks serves as useful tool for fair use of language
models, such as detecting potential copyright infringement and auditing data
leakage. However, many current state-of-the-art attacks require access to
models' hidden states or probability distribution, which prevents investigation
into more widely-used, API-access only models like GPT-4. In this work, we
introduce N-Gram Coverage Attack, a membership inference attack that relies
solely on text outputs from the target model, enabling attacks on completely
black-box models. We leverage the observation that models are more likely to
memorize and subsequently generate text patterns that were commonly observed in
their training data. Specifically, to make a prediction on a candidate member,
N-Gram Coverage Attack first obtains multiple model generations conditioned on
a prefix of the candidate. It then uses n-gram overlap metrics to compute and
aggregate the similarities of these outputs with the ground truth suffix; high
similarities indicate likely membership. We first demonstrate on a diverse set
of existing benchmarks that N-Gram Coverage Attack outperforms other black-box
methods while also impressively achieving comparable or even better performance
to state-of-the-art white-box attacks - despite having access to only text
outputs. Interestingly, we find that the success rate of our method scales with
the attack compute budget - as we increase the number of sequences generated
from the target model conditioned on the prefix, attack performance tends to
improve. Having verified the accuracy of our method, we use it to investigate
previously unstudied closed OpenAI models on multiple domains. We find that
more recent models, such as GPT-4o, exhibit increased robustness to membership
inference, suggesting an evolving trend toward improved privacy protections.