针对大规模数据集与（中等规模）大型语言模型的强成员推断攻击

Strong Membership Inference Attacks on Massive Datasets and (Moderately) Large Language Models

May 24, 2025

作者: Jamie Hayes, Ilia Shumailov, Christopher A. Choquette-Choo, Matthew Jagielski, George Kaissis, Katherine Lee, Milad Nasr, Sahra Ghalebikesabi, Niloofar Mireshghallah, Meenatchi Sundaram Mutu Selva Annamalai, Igor Shilov, Matthieu Meeus, Yves-Alexandre de Montjoye, Franziska Boenisch, Adam Dziedzic, A. Feder Cooper

cs.AI

摘要

現今最先進的成員推斷攻擊（MIAs）通常需要訓練多個參考模型，這使得將這些攻擊擴展到大型預訓練語言模型（LLMs）變得困難。因此，先前的研究要么依賴於避免訓練參考模型的較弱攻擊（例如，微調攻擊），要么依賴於應用於小規模模型和數據集的較強攻擊。然而，較弱的攻擊已被證明是脆弱的——其成功率接近隨機——而在簡化設置中強攻擊的洞察並不能轉移到當今的LLMs上。這些挑戰引發了一個重要問題：先前工作中觀察到的限制是由於攻擊設計選擇，還是MIAs在LLMs上根本無效？我們通過將LiRA——最強的MIAs之一——擴展到參數範圍從10M到1B的GPT-2架構，並在C4數據集上訓練超過20B個token的參考模型，來解決這個問題。我們的結果在三個關鍵方面推進了對LLMs上MIAs的理解：（1）強MIAs可以在預訓練的LLMs上成功；（2）然而，在實際設置中，其有效性仍然有限（例如，AUC<0.7）；（3）MIA成功與相關隱私指標之間的關係並不像先前工作所暗示的那樣直接。

English

State-of-the-art membership inference attacks (MIAs) typically require training many reference models, making it difficult to scale these attacks to large pre-trained language models (LLMs). As a result, prior research has either relied on weaker attacks that avoid training reference models (e.g., fine-tuning attacks), or on stronger attacks applied to small-scale models and datasets. However, weaker attacks have been shown to be brittle - achieving close-to-arbitrary success - and insights from strong attacks in simplified settings do not translate to today's LLMs. These challenges have prompted an important question: are the limitations observed in prior work due to attack design choices, or are MIAs fundamentally ineffective on LLMs? We address this question by scaling LiRA - one of the strongest MIAs - to GPT-2 architectures ranging from 10M to 1B parameters, training reference models on over 20B tokens from the C4 dataset. Our results advance the understanding of MIAs on LLMs in three key ways: (1) strong MIAs can succeed on pre-trained LLMs; (2) their effectiveness, however, remains limited (e.g., AUC<0.7) in practical settings; and, (3) the relationship between MIA success and related privacy metrics is not as straightforward as prior work has suggested.