단순 N-그램 커버리지를 활용한 멤버십 추론의 놀라운 효과성

초록

멤버십 추론 공격은 언어 모델의 공정한 사용을 위한 유용한 도구로, 잠재적인 저작권 침해 탐지 및 데이터 유출 감사와 같은 목적으로 활용될 수 있습니다. 그러나 현재 최신 공격 기법들은 대부분 모델의 은닉 상태나 확률 분포에 대한 접근을 필요로 하기 때문에, GPT-4와 같이 API 접근만 가능한 널리 사용되는 모델에 대한 조사가 어렵습니다. 본 연구에서는 완전한 블랙박스 모델에 대한 공격을 가능하게 하는, 대상 모델의 텍스트 출력만을 활용한 멤버십 추론 공격인 N-Gram Coverage Attack을 소개합니다. 우리는 모델이 학습 데이터에서 자주 관찰된 텍스트 패턴을 더 잘 기억하고 생성할 가능성이 높다는 관찰을 활용합니다. 구체적으로, N-Gram Coverage Attack은 후보 멤버에 대한 예측을 위해 먼저 후보의 접두사를 조건으로 한 여러 모델 생성 결과를 얻습니다. 그런 다음, n-gram 중첩 메트릭을 사용하여 이러한 출력과 실제 접미사 간의 유사성을 계산하고 집계하며, 높은 유사성은 멤버십 가능성을 나타냅니다. 우리는 먼저 다양한 기존 벤치마크에서 N-Gram Coverage Attack이 다른 블랙박스 방법을 능가하며, 텍스트 출력만 접근 가능한 상황에서도 최신 화이트박스 공격 기법들과 비슷하거나 더 나은 성능을 보임을 입증합니다. 흥미롭게도, 우리의 방법의 성공률은 공격 계산 예산에 따라 확장되는데, 접두사를 조건으로 대상 모델에서 생성된 시퀀스의 수를 증가시킬수록 공격 성능이 향상되는 경향이 있습니다. 우리는 이 방법의 정확성을 검증한 후, 이를 사용하여 이전에 연구되지 않은 OpenAI의 폐쇄형 모델을 여러 도메인에서 조사합니다. 그 결과, GPT-4o와 같은 최신 모델들이 멤버십 추론에 대해 더 강인함을 보이며, 개인 정보 보호가 점점 강화되는 추세를 시사함을 발견했습니다.

English

Membership inference attacks serves as useful tool for fair use of language models, such as detecting potential copyright infringement and auditing data leakage. However, many current state-of-the-art attacks require access to models' hidden states or probability distribution, which prevents investigation into more widely-used, API-access only models like GPT-4. In this work, we introduce N-Gram Coverage Attack, a membership inference attack that relies solely on text outputs from the target model, enabling attacks on completely black-box models. We leverage the observation that models are more likely to memorize and subsequently generate text patterns that were commonly observed in their training data. Specifically, to make a prediction on a candidate member, N-Gram Coverage Attack first obtains multiple model generations conditioned on a prefix of the candidate. It then uses n-gram overlap metrics to compute and aggregate the similarities of these outputs with the ground truth suffix; high similarities indicate likely membership. We first demonstrate on a diverse set of existing benchmarks that N-Gram Coverage Attack outperforms other black-box methods while also impressively achieving comparable or even better performance to state-of-the-art white-box attacks - despite having access to only text outputs. Interestingly, we find that the success rate of our method scales with the attack compute budget - as we increase the number of sequences generated from the target model conditioned on the prefix, attack performance tends to improve. Having verified the accuracy of our method, we use it to investigate previously unstudied closed OpenAI models on multiple domains. We find that more recent models, such as GPT-4o, exhibit increased robustness to membership inference, suggesting an evolving trend toward improved privacy protections.

단순 N-그램 커버리지를 활용한 멤버십 추론의 놀라운 효과성

The Surprising Effectiveness of Membership Inference with Simple N-Gram Coverage

초록

Support