대규모 데이터셋 및 (중간 규모) 대형 언어 모델에 대한 강력한 멤버십 추론 공격

초록

최신 멤버십 추론 공격(MIAs)은 일반적으로 많은 참조 모델을 학습해야 하기 때문에, 이러한 공격을 대규모 사전 학습된 언어 모델(LLMs)로 확장하는 것이 어렵습니다. 이로 인해 기존 연구는 참조 모델 학습을 피하는 약한 공격(예: 미세 조정 공격)에 의존하거나, 소규모 모델 및 데이터셋에 적용된 강력한 공격에 의존해 왔습니다. 그러나 약한 공격은 취약하여 거의 임의적인 성공을 거두는 것으로 나타났으며, 단순화된 설정에서의 강력한 공격으로부터 얻은 통찰력은 오늘날의 LLMs로 이전되지 않습니다. 이러한 도전 과제들은 중요한 질문을 제기했습니다: 기존 연구에서 관찰된 한계는 공격 설계 선택 때문인가, 아니면 MIAs가 근본적으로 LLMs에 효과적이지 않은 것인가? 우리는 이 질문에 답하기 위해 가장 강력한 MIAs 중 하나인 LiRA를 GPT-2 아키텍처(10M에서 1B 파라미터 범위)로 확장하고, C4 데이터셋에서 20B 이상의 토큰을 사용하여 참조 모델을 학습했습니다. 우리의 결과는 LLMs에 대한 MIAs의 이해를 세 가지 주요 방식으로 발전시켰습니다: (1) 강력한 MIAs는 사전 학습된 LLMs에서 성공할 수 있습니다; (2) 그러나 실제 설정에서 그 효과는 여전히 제한적입니다(예: AUC<0.7); 그리고 (3) MIA 성공과 관련된 프라이버시 메트릭 간의 관계는 기존 연구가 제안한 것만큼 간단하지 않습니다.

English

State-of-the-art membership inference attacks (MIAs) typically require training many reference models, making it difficult to scale these attacks to large pre-trained language models (LLMs). As a result, prior research has either relied on weaker attacks that avoid training reference models (e.g., fine-tuning attacks), or on stronger attacks applied to small-scale models and datasets. However, weaker attacks have been shown to be brittle - achieving close-to-arbitrary success - and insights from strong attacks in simplified settings do not translate to today's LLMs. These challenges have prompted an important question: are the limitations observed in prior work due to attack design choices, or are MIAs fundamentally ineffective on LLMs? We address this question by scaling LiRA - one of the strongest MIAs - to GPT-2 architectures ranging from 10M to 1B parameters, training reference models on over 20B tokens from the C4 dataset. Our results advance the understanding of MIAs on LLMs in three key ways: (1) strong MIAs can succeed on pre-trained LLMs; (2) their effectiveness, however, remains limited (e.g., AUC<0.7) in practical settings; and, (3) the relationship between MIA success and related privacy metrics is not as straightforward as prior work has suggested.

대규모 데이터셋 및 (중간 규모) 대형 언어 모델에 대한 강력한 멤버십 추론 공격

Strong Membership Inference Attacks on Massive Datasets and (Moderately) Large Language Models

초록

Support