ChatPaper.aiChatPaper

检测大型语言模型强化学习后训练中的数据污染问题

Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models

October 10, 2025
作者: Yongding Tao, Tian Wang, Yihong Dong, Huanyu Liu, Kechi Zhang, Xiaolong Hu, Ge Li
cs.AI

摘要

数据污染对大型语言模型(LLMs)的可靠评估构成了重大威胁。这一问题发生在基准测试样本可能无意间出现在训练集中时,从而损害了所报告性能的有效性。尽管已有检测方法针对预训练和监督微调阶段开发,但在日益重要的强化学习(RL)后训练阶段,存在一个关键的研究空白。随着RL后训练成为推进LLM推理能力的关键,这一范式中缺乏专门的污染检测方法,暴露了一个严重的脆弱性。为解决此问题,我们首次系统性地研究了RL后训练场景下的数据检测,并提出了自我批判(Self-Critique)方法。我们的方法基于一个关键观察:经过RL阶段后,LLMs的输出熵分布倾向于坍缩为高度特定且稀疏的模式。自我批判旨在探测潜在的策略坍缩,即模型收敛至狭窄的推理路径,导致熵的减少。为支持这一研究,我们还引入了RL-MIA,一个专门构建的基准,用于模拟这一特定的污染场景。大量实验表明,自我批判在多个模型和污染任务上显著优于基线方法,AUC提升高达30%。而现有方法在RL阶段污染检测上近乎随机猜测,我们的方法则使检测成为可能。
English
Data contamination poses a significant threat to the reliable evaluation of Large Language Models (LLMs). This issue arises when benchmark samples may inadvertently appear in training sets, compromising the validity of reported performance. While detection methods have been developed for the pre-training and Supervised Fine-Tuning stages, a critical research gap exists for the increasingly significant phase of Reinforcement Learning (RL) post-training. As RL post-training becomes pivotal for advancing LLM reasoning, the absence of specialized contamination detection methods in this paradigm presents a critical vulnerability. To address this, we conduct the first systematic study of data detection within RL post-training scenario and propose Self-Critique. Our method is motivated by a key observation: after RL phase, the output entropy distribution of LLMs tends to collapse into highly specific and sparse modes. Self-Critique probes for the underlying policy collapse, i.e., the model's convergence to a narrow reasoning path, which causes this entropy reduction. To facilitate this research, we also introduce RL-MIA, a benchmark constructed to simulate this specific contamination scenario. Extensive experiments show that Self-Critique significantly outperforms baseline methods across multiple models and contamination tasks, achieving an AUC improvement of up to 30%. Whereas existing methods are close to a random guess for RL-phase contamination, our method makes detection possible.
PDF22October 15, 2025