基于主动重构的语言模型训练数据检测方法研究

摘要

检测大型语言模型训练数据通常被定义为成员推理攻击问题。然而传统MIA方法被动地基于固定模型权重，利用对数似然或文本来生成进行检测。本研究提出主动数据重构攻击（ADRA）——一类通过主动诱导模型在训练过程中重构给定文本来实现MIA的新方法。我们假设训练数据比非成员数据更具可重构性，这种可重构性差异可用于成员资格推断。基于强化学习能锐化权重中已有行为的研究发现，我们采用同策略强化学习技术，通过微调从目标模型初始化的策略来主动激发数据重构。为有效实现基于RL的MIA，我们设计了重构度量指标和对比奖励机制。最终形成的ADRA及其自适应变体ADRA+算法，在给定候选数据池的情况下显著提升了数据重构能力和检测效能。实验表明，我们的方法在检测预训练、后训练和蒸馏数据时持续优于现有MIA方案，相较原亚军方法平均提升10.7%。特别是在预训练检测的BookMIA任务中，ADRA+较Min-K%++提升18.8%；在后训练检测的AIME任务中提升7.6%。

English

Detecting LLM training data is generally framed as a membership inference attack (MIA) problem. However, conventional MIAs operate passively on fixed model weights, using log-likelihoods or text generations. In this work, we introduce Active Data Reconstruction Attack (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training. We hypothesize that training data are more reconstructible than non-members, and the difference in their reconstructibility can be exploited for membership inference. Motivated by findings that reinforcement learning (RL) sharpens behaviors already encoded in weights, we leverage on-policy RL to actively elicit data reconstruction by finetuning a policy initialized from the target model. To effectively use RL for MIA, we design reconstruction metrics and contrastive rewards. The resulting algorithms, ADRA and its adaptive variant ADRA+, improve both reconstruction and detection given a pool of candidate data. Experiments show that our methods consistently outperform existing MIAs in detecting pre-training, post-training, and distillation data, with an average improvement of 10.7\% over the previous runner-up. In particular, \MethodPlus~improves over Min-K\%++ by 18.8\% on BookMIA for pre-training detection and by 7.6\% on AIME for post-training detection.

基于主动重构的语言模型训练数据检测方法研究

Learning to Detect Language Model Training Data via Active Reconstruction

摘要

Support