활성 재구성을 통한 언어 모델 학습 데이터 탐지 방법 학습

초록

LLM 훈련 데이터 탐지는 일반적으로 멤버십 추론 공격(MIA) 문제로 정의된다. 그러나 기존 MIA는 고정된 모델 가중치에 대해 로그 우도나 텍스트 생성을 사용하여 수동적으로 작동한다. 본 연구에서는 훈련을 통해 주어진 텍스트의 재구성을 모델이 능동적으로 수행하도록 유도하는 MIA 기법 계열인 능동적 데이터 재구성 공격(ADRA)을 소개한다. 우리는 훈련 데이터가 비회원 데이터보다 재구성이 더 용이하며, 이 재구성 가능성 차이를 멤버십 추론에 활용할 수 있다고 가정한다. 강화 학습(RL)이 가중치에 이미 인코딩된 행동을 선명하게 만든다는 연구 결과에 착안하여, 우리는 대상 모델로 초기화된 정책을 파인튜닝하여 데이터 재구성을 능동적으로 이끌어내기 위해 온-폴리시 RL을 활용한다. MIA에 RL을 효과적으로 적용하기 위해 재구성 메트릭과 대조적 보상을 설계한다. 이를 통해 도출된 알고리즘인 ADRA 및 그의 적응형 변종 ADRA+는 후보 데이터 풀이 주어졌을 때 재구성과 탐지 성능을 모두 향상시킨다. 실험 결과, 우리의 방법은 사전 훈련, 사후 훈련, 증류 데이터 탐지에서 기존 MIA를 지속적으로 능가하며, 평균 10.7%의 성능 향상을 보인다. 특히 ADRA+는 사전 훈련 탐지를 위한 BookMIA에서 Min-K%++ 대비 18.8%, 사후 훈련 탐지를 위한 AIME에서 7.6% 향상된 성능을 보인다.

English

Detecting LLM training data is generally framed as a membership inference attack (MIA) problem. However, conventional MIAs operate passively on fixed model weights, using log-likelihoods or text generations. In this work, we introduce Active Data Reconstruction Attack (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training. We hypothesize that training data are more reconstructible than non-members, and the difference in their reconstructibility can be exploited for membership inference. Motivated by findings that reinforcement learning (RL) sharpens behaviors already encoded in weights, we leverage on-policy RL to actively elicit data reconstruction by finetuning a policy initialized from the target model. To effectively use RL for MIA, we design reconstruction metrics and contrastive rewards. The resulting algorithms, ADRA and its adaptive variant ADRA+, improve both reconstruction and detection given a pool of candidate data. Experiments show that our methods consistently outperform existing MIAs in detecting pre-training, post-training, and distillation data, with an average improvement of 10.7\% over the previous runner-up. In particular, \MethodPlus~improves over Min-K\%++ by 18.8\% on BookMIA for pre-training detection and by 7.6\% on AIME for post-training detection.

활성 재구성을 통한 언어 모델 학습 데이터 탐지 방법 학습

Learning to Detect Language Model Training Data via Active Reconstruction

초록

Support